Improved outlier detection using sparse coding-based methods

Published on May 1, 2019in Pattern Recognition Letters3.255
· DOI :10.1016/J.PATREC.2019.02.022
Jayanta K. Dutta7
Estimated H-index: 7
(Bosch),
Bonny Banerjee11
Estimated H-index: 11
(U of M: University of Memphis)
Sources
Abstract
Abstract Outlier detection is an active area of research in data mining and a large number of algorithms exist. Our goal is to come up with a guideline on how to choose the most appropriate outlier detection algorithm for a given dataset without exploiting any domain- or application-specific information. Extensive experimentations with a number of state-of-the-art algorithms on thousands of benchmark datasets revealed a clear trend. For datasets with low dimensionality and low difficulty level, traditional methods outperform sparse coding-based outlier detection (SCOD) algorithms. But the trend reverses as the dimensionality or difficulty level increases. A threshold emerges as the point of intersection of the trends for SCOD and traditional algorithms, which is 250 and 21 for dimensionality and difficulty level respectively.
📖 Papers frequently viewed together
2 Citations
2013MLDM: Machine Learning and Data Mining in Pattern Recognition
2 Citations
References29
Newest
#1Guojun Gan (UConn: University of Connecticut)H-Index: 17
#2Michael K. Ng (Hong Kong Baptist University)H-Index: 80
We study the problem of data clustering with outlier detection.We propose a k-means-type algorithm by incorporating an additional cluster into the objective function.The algorithm is able to provide data clustering and outlier detection simultaneously.Outliers are not used in the cluster center calculation.Experiments on synthetic and real data show that the algorithm performs well. Outlier detection is an important data analysis task in its own right and removing the outliers from clusters can ...
77 CitationsSource
#1Jayanta K. Dutta (U of M: University of Memphis)H-Index: 7
#2Bonny BanerjeeH-Index: 11
Last. Chandan K. Reddy (WSU: Wayne State University)H-Index: 28
view all 3 authors...
Outlier detection has been an active area of research for a few decades. We propose a new definition of outlier that is useful for high-dimensional data. According to this definition, given a dictionary of atoms learned using the sparse coding objective, the outlierness of a data point depends jointly on two factors: the frequency of each atom in reconstructing all data points (or its negative log activity ratio, NLAR) and the strength by which it is used in reconstructing the current point. A R...
12 CitationsSource
#1Zhenni Li (University of Aizu)H-Index: 11
#2Shuxue Ding (University of Aizu)H-Index: 14
Last. Yujie Li (University of Aizu)H-Index: 5
view all 3 authors...
We present a fast, efficient algorithm for learning an overcomplete dictionary for sparse representation of signals. The whole problem is considered as a minimization of the approximation error function with a coherence penalty for the dictionary atoms and with the sparsity regularization of the coefficient matrix. Because the problem is nonconvex and nonsmooth, this minimization problem cannot be solved efficiently by an ordinary optimization method. We propose a decomposition scheme and an alt...
27 CitationsSource
Jan 25, 2015 in AAAI (National Conference on Artificial Intelligence)
#1Jayanta K. Dutta (U of M: University of Memphis)H-Index: 7
#2Bonny Banerjee (U of M: University of Memphis)H-Index: 11
We present an unsupervised approach for abnormal event detection in videos. We propose, given a dictionary of features learned from local spatiotemporal cuboids using the sparse coding objective, the abnormality of an event depends jointly on two factors: the frequency of each feature in reconstructing all events (or, rarity of a feature) and the strength by which it is used in reconstructing the current event (or, the absolute coefficient). The Incremental Coding Length (ICL) of a feature is a ...
28 Citations
#1Tülin İnkaya (Uludağ University)H-Index: 7
#2Sinan Kayaligil (METU: Middle East Technical University)H-Index: 8
Last. Nur Evin Özdemirel (METU: Middle East Technical University)H-Index: 13
view all 3 authors...
We introduce a novel neighbourhood construction (NC) algorithm.NC algorithm can work in the data sets with no a priori information.NC is parameter-free, and it yields a unique neighbourhood for each data point.We demonstrate the use of NC on clustering and local outlier detection problems. Display Omitted A neighbourhood is a refined group of data points that are locally similar. It should be defined based on the local relations in a data set. However, selection of neighbourhood parameters is an...
22 CitationsSource
#1Bonny Banerjee (U of M: University of Memphis)H-Index: 11
#2Jayanta K. Dutta (U of M: University of Memphis)H-Index: 7
Abstract Sensors that monitor around the clock are everywhere. Due to the sheer amount of data these sensors can generate, the resources required to store, protect personal information, and analyze them are enormous. Since noteworthy events happen only occasionally, it is not necessary to store or analyze the data generated at every instant of time. Rather, it is imperative for a smart memory to learn the norms in such data so that only the abnormal (or salient) events may be stored. We present ...
9 CitationsSource
#1Amir Adler (Technion – Israel Institute of Technology)H-Index: 12
#2Michael Elad (Technion – Israel Institute of Technology)H-Index: 92
Last. Ehud Rivlin (Technion – Israel Institute of Technology)H-Index: 5
view all 4 authors...
We consider the problem of simultaneous sparse coding and anomaly detection in a collection of data vectors. The majority of the data vectors are assumed to conform with a sparse representation model, whereas the anomaly is caused by an unknown subset of the data vectors - the outliers - which significantly deviate from this model. The proposed approach utilizes the Alternating Direction Method of Multipliers (ADMM) to recover simultaneously the sparse representations and the outliers components...
76 CitationsSource
Aug 11, 2013 in KDD (Knowledge Discovery and Data Mining)
#1Andrew Emmott (OSU: Oregon State University)H-Index: 5
#2Shubhomoy Das (OSU: Oregon State University)H-Index: 10
Last. Weng-Keen Wong (OSU: Oregon State University)H-Index: 36
view all 5 authors...
Research in anomaly detection suffers from a lack of realistic and publicly-available problem sets. This paper discusses what properties such problem sets should possess. It then introduces a methodology for transforming existing classification data sets into ground-truthed benchmark data sets for anomaly detection. The methodology produces data sets that vary along three important dimensions: (a) point difficulty, (b) relative frequency of anomalies, and (c) clusteredness. We apply our generate...
101 CitationsSource
Jan 1, 2013 in SDM (SIAM International Conference on Data Mining)
#1Sanjay Chawla (USYD: University of Sydney)H-Index: 37
#2Aristides Gionis (Aalto University)H-Index: 63
128 Citations
#1Arthur Zimek (U of A: University of Alberta)H-Index: 40
#2Erich Schubert (LMU: Ludwig Maximilian University of Munich)H-Index: 23
Last. Hans-Peter Kriegel (LMU: Ludwig Maximilian University of Munich)H-Index: 99
view all 3 authors...
High-dimensional data in Euclidean space pose special challenges to data mining algorithms. These challenges are often indiscriminately subsumed under the term ‘curse of dimensionality’, more concrete aspects being the so-called ‘distance concentration effect’, the presence of irrelevant attributes concealing relevant information, or simply efficiency issues. In about just the last few years, the task of unsupervised outlier detection has found new specialized solutions for tackling high-dimensi...
425 CitationsSource
Cited By3
Newest
#1Julien LesoupleH-Index: 1
#2Cedric Baudoin (Alenia Aeronautica)H-Index: 10
Last. Jean-Yves Tourneret (University of Toulouse)H-Index: 51
view all 4 authors...
Abstract null null This letter introduces a generalization of Isolation Forest (IF) based on the existing Extended IF (EIF). EIF has shown some interest compared to IF being for instance more robust to some artefacts. However, some information can be lost when computing the EIF trees since the sampled threshold might lead to empty branches. This letter introduces a generalized isolation forest algorithm called Generalized IF (GIF) to overcome these issues. GIF is faster than EIF with a similar p...
Source
#1Yeliz YengiH-Index: 3
#2Adnan KavakH-Index: 13
Last. Huseyin ArslanH-Index: 50
view all 3 authors...
For Long Term Evolution Advanced (LTE-A) network, although there exist many studies that focus on improving the performance with relays, security issues are often neglected. Due to the broadcast nature of wireless channels, relay nodes in LTE-A network may act maliciously, affect communication, reduce quality, and cause delays. Recently, physical (PHY) layer security has attracted researchers to provide secure communication and data privacy. In this study, we propose using unsupervised learning ...
1 CitationsSource
#1Ping Gu (Chongqing University)
#2Meng Chow (Chongqing University)
Last. Si Yu Shao (Chongqing University)
view all 3 authors...
Outlier detection is an important branch in data mining and plays a vital role in broad range of applications including network-traffic anomaly detection, credit fraud prevention, etc. Based on the assumption that dataset can be approximately reconstructed by linear combinations of dictionary atoms, some detection algorithms initially project the data to a higher dimensional manifold such that data representation becomes sparse. Unlike previous sparse coding based approaches, our method SNOD (Sp...
Source