Learning High-Order Interactions via Targeted Pattern Search.

Published on Feb 23, 2021in arXiv: Learning
Michela Carlotta Massi2
Estimated H-index: 2
(Polytechnic University of Milan),
Nicola Rares Franco1
Estimated H-index: 1
+ 3 AuthorsPaolo Zunino28
Estimated H-index: 28
Sources
Abstract
Logistic Regression (LR) is a widely used statistical method in empirical binary classification studies. However, real-life scenarios oftentimes share complexities that prevent from the use of the as-is LR model, and instead highlight the need to include high-order interactions to capture data variability. This becomes even more challenging because of: (i) datasets growing wider, with more and more variables; (ii) studies being typically conducted in strongly imbalanced settings; (iii) samples going from very large to extremely small; (iv) the need of providing both predictive models and interpretable results. In this paper we present a novel algorithm, Learning high-order Interactions via targeted Pattern Search (LIPS), to select interaction terms of varying order to include in a LR model for an imbalanced binary classification task when input data are categorical. LIPS's rationale stems from the duality between item sets and categorical interactions. The algorithm relies on an interaction learning step based on a well-known frequent item set mining algorithm, and a novel dissimilarity-based interaction selection step that allows the user to specify the number of interactions to be included in the LR model. In addition, we particularize two variants (Scores LIPS and Clusters LIPS), that can address even more specific needs. Through a set of experiments we validate our algorithm and prove its wide applicability to real-life research scenarios, showing that it outperforms a benchmark state-of-the-art algorithm.
References33
Newest
#1Lian Niu (UNC: University of North Carolina at Chapel Hill)H-Index: 1
ABSTRACTThis study reviews the international literature of empirical educational research to examine the application of logistic regression. The aim is to examine common practices of the report and interpretation of logistic regression results, and to discuss the implications for educational research. A review of 130 studies suggests that: (a) the majority of studies report statistical significance and sign of predictors but do not interpret relationship magnitude in terms of probabilities; (b) ...
6 CitationsSource
#2Daniel A. AbayeH-Index: 7
This study explored and reviewed the logistic regression (LR) model, a multivariable method for modeling the relationship between multiple independent variables and a categorical dependent variable, with emphasis on medical research. Thirty seven research articles published between 2000 and 2018 which employed logistic regression as the main statistical tool as well as six text books on logistic regression were reviewed. Logistic regression concepts such as odds, odds ratio, logit transformation...
2 CitationsSource
#1Gaia CeddiaH-Index: 2
#2Liuba Nausicaa Martino (Polytechnic University of Milan)H-Index: 1
Last. Marco MasseroliH-Index: 23
view all 6 authors...
MOTIVATION: Genome regulatory networks have different layers and ways to modulate cellular processes, such as cell differentiation, proliferation and adaptation to external stimuli. Transcription factors and other chromatin associated proteins act as combinatorial protein complexes that control gene transcription. Thus, identifying functional interaction networks among these proteins is a fundamental task to understand the genome regulation framework. RESULTS: We developed a novel approach to in...
5 CitationsSource
Aug 1, 2019 in IJCAI (International Joint Conference on Artificial Intelligence)
#1Mahito Sugiyama (NII: National Institute of Informatics)H-Index: 9
#2Karsten M. Borgwardt (Swiss Institute of Bioinformatics)H-Index: 52
The search for higher-order feature interactions that are statistically significantly associated with a class variable is of high relevance in fields such as Genetics or Healthcare, but the combinatorial explosion of the candidate space makes this problem extremely challenging in terms of computational efficiency and proper correction for multiple testing. While recent progress has been made regarding this challenge for binary features, we here present the first solution for continuous features....
2 CitationsSource
#1Felipe Llinares-López (Swiss Institute of Bioinformatics)H-Index: 7
#2Laetitia Papaxanthos (Swiss Institute of Bioinformatics)H-Index: 6
Last. Karsten M. Borgwardt (Swiss Institute of Bioinformatics)H-Index: 52
view all 5 authors...
SUMMARY: Combinatorial association mapping aims to assess the statistical association of higher-order interactions of genetic markers with a phenotype of interest. This article presents combinatorial association mapping (CASMAP), a software package that leverages recent advances in significant pattern mining to overcome the statistical and computational challenges that have hindered combinatorial association mapping. CASMAP can be used to perform region-based association studies and to detect hi...
6 CitationsSource
Jul 19, 2018 in KDD (Knowledge Discovery and Data Mining)
#1Leonardo Pellegrina (UNIPD: University of Padua)H-Index: 5
#2Fabio Vandin (UNIPD: University of Padua)H-Index: 21
The extraction of patterns displaying significant association with a class label is a key data mining task with wide application in many domains. We study a variant of the problem that requires to mine the top-k statistically significant patterns, thus providing tight control on the number of patterns reported in output. We develop TopKWY, the first algorithm to mine the top-k significant patterns while rigorously controlling the family-wise error rate of the output and provide theoretical evide...
9 CitationsSource
#1Bekti Cahyo Hidayanto (ITS: Sepuluh Nopember Institute of Technology)H-Index: 2
#2Rowi Fajar Muhammad (ITS: Sepuluh Nopember Institute of Technology)H-Index: 1
Last. Achmad SyafaatH-Index: 1
view all 4 authors...
Abstract Within the fast growing of internet user and technology in Indonesia, thus threat coming from internet is raising. The threat is common for all user in the world. Therefore, the malware has growth rapidly and the behavior is becoming more advanced. From these problem, it is important to know, how the malware is growing and how the characteristics about malware attack in Indonesia. This research aim used the data source taken from Intrusion Detection Systems sensor from Id-SIRTII/CC, Min...
5 CitationsSource
Jan 1, 2016 in NeurIPS (Neural Information Processing Systems)
#1Laetitia Papaxanthos (ETH Zurich)H-Index: 6
#2Felipe Llinares-López (ETH Zurich)H-Index: 7
Last. Karsten M. Borgwardt (ETH Zurich)H-Index: 52
view all 4 authors...
In high-dimensional settings, where the number of features p is typically much larger than the number of samples n, methods which can systematically examine arbitrary combinations of features, a huge 2^p-dimensional space, have recently begun to be explored. However, none of the current methods is able to assess the association between feature combinations and a target variable while conditioning on a categorical covariate, in order to correct for potential confounding effects. We propose the Fa...
12 Citations
Aug 10, 2015 in KDD (Knowledge Discovery and Data Mining)
#1Felipe Llinares-López (ETH Zurich)H-Index: 7
#2Mahito Sugiyama (Osaka University)H-Index: 9
Last. Karsten M. Borgwardt (ETH Zurich)H-Index: 52
view all 4 authors...
We present a novel algorithm for significant pattern mining, Westfall-Young light. The target patterns are statistically significantly enriched in one of two classes of objects. Our method corrects for multiple hypothesis testing and correlations between patterns via the Westfall-Young permutation procedure, which empirically estimates the null distribution of pattern frequencies in each class via permutations. In our experiments, Westfall-Young light dramatically outperforms the current state-o...
26 CitationsSource
#1Rajen D. Shah (University of Cambridge)H-Index: 8
#2Nicolai Meinshausen (ETH Zurich)H-Index: 36
Finding interactions between variables in large and high-dimensional data sets is often a serious computational challenge. Most approaches build up interaction sets incrementally, adding variables in a greedy fashion. The drawback is that potentially informative high-order interactions may be overlooked. Here, we propose an alternative approach for classification problems with binary predictor variables, called Random Intersection Trees. It works by starting with a maximal interaction that inclu...
24 CitationsSource
Cited By0
Newest