Data mining application to healthcare fraud detection: a two-step unsupervised clustering method for outlier detection with administrative databases.

Published on Jul 14, 2020in BMC Medical Informatics and Decision Making2.317
· DOI :10.1186/S12911-020-01143-9
Michela Carlotta Massi2
Estimated H-index: 2
(Polytechnic University of Milan),
Francesca Ieva14
Estimated H-index: 14
(Polytechnic University of Milan),
Emanuele Lettieri19
Estimated H-index: 19
(Polytechnic University of Milan)
BACKGROUND The healthcare sector is an interesting target for fraudsters. The availability of a great amount of data makes it possible to tackle this issue with the adoption of data mining techniques, making the auditing process more efficient and effective. This research has the objective of developing a novel data mining model devoted to fraud detection among hospitals using Hospital Discharge Charts (HDC) in Administrative Databases. In particular, it is focused on the DRG upcoding practice, i.e., the tendency of registering codes for provided services and inpatients health status so to make the hospitalization fall within a more remunerative DRG class. METHODS We propose a two-step algorithm: the first step entails kmeans clustering of providers to identify locally consistent and locally similar groups of hospitals, according to their characteristics and behavior treating a specific disease, in order to spot outliers within this groups of peers. An initial grid search for the best number of features to be selected (through Principal Feature Analysis) and the best number of local groups makes the algorithm extremely flexible. In the second step, we propose a human-decision support system that helps auditors cross-validating the identified outliers, analyzing them w.r.t. fraud-related variables, and the complexity of patients' casemix they treated. The proposed algorithm was tested on a database relative to HDC collected by Regione Lombardia (Italy) in a time period of three years (2013-2015), focusing on the treatment of Heart Failure. RESULTS The model identified 6 clusters of hospitals and 10 outliers among the 183 units. Out of those providers, we report the in depth the application of Step Two on three Hospitals (two private and one public). Cross-validating with the patients' population and the hospitals' characteristics, the public hospital seemed justified in its outlierness, while the two private providers were deemed interesting for a further investigation by auditors. CONCLUSIONS The proposed model is promising in identifying anomalous DRG coding behavior and it is easily transferrable to all diseases and contexts of interest. Our proposal contributes to the limited literature regarding behavioral models for fraud detection, identifying the most 'cautious' fraudsters. The results of the first and the second Steps together represent a valuable set of information for auditors in their preliminary investigation.
Figures & Tables
📖 Papers frequently viewed together
2016TrustCom: Trust, Security And Privacy In Computing And Communications
4 Authors (Haoyi Cui, ..., Zhongmin Yan)
1 Citations
3 Citations
17 Citations
#1Md. Rezaul Karim (Fraunhofer Society)H-Index: 12
#1Rezaul Karim (Fraunhofer Society)H-Index: 3
Last. Stefan Decker (Fraunhofer Society)H-Index: 76
view all 7 authors...
Clustering is central to many data-driven bioinformatics research and serves a powerful computational method. In particular, clustering helps at analyzing unstructured and high-dimensional data in the form of sequences, expressions, texts and images. Further, clustering is used to gain insights into biological processes in the genomics level, e.g. clustering of gene expressions provides insights on the natural structure inherent in the data, understanding gene functions, cellular processes, subt...
14 CitationsSource
#1Richard A. Bauder (FAU: Florida Atlantic University)H-Index: 5
#2Taghi M. Khoshgoftaar (FAU: Florida Atlantic University)H-Index: 82
Last. Naeem SeliyaH-Index: 30
view all 3 authors...
From its infancy in the 1910s, healthcare group insurance continues to increase, creating a consistently rising burden on the government and taxpayers. The growing number of people enrolled in healthcare programs such as Medicare, along with the enormous volume of money in the healthcare industry, increases the appeal for and risk of fraudulent activities. One such fraud, known as upcoding, is a means by which a provider can obtain additional reimbursement by coding a certain provided service as...
32 CitationsSource
#1Tahir Ekin (Texas State University)H-Index: 8
#2Francesca Ieva (Polytechnic University of Milan)H-Index: 14
Last. Refik Soyer (GW: George Washington University)H-Index: 22
view all 4 authors...
ABSTRACTWe propose a simple, but effective, tool to detect possible anomalies in the services prescribed by a health care provider (HP) compared to his/her colleagues in the same field and environment. Our method is based on the concentration function that is an extension of the Lorenz curve widely used in describing uneven distribution of wealth in a population. The proposed tool provides a graphical illustration of a possible anomalous behavior of the HPs and it can be used as a prescreening d...
4 CitationsSource
Jun 1, 2016 in ICEIS (International Conference on Enterprise Information Systems)
#1Guido van Capelleveen (UT: University of Twente)H-Index: 8
#2Mannes Poel (UT: University of Twente)H-Index: 26
Last. Jos van Hillegersberg (UT: University of Twente)H-Index: 21
view all 5 authors...
Health care insurance fraud is a pressing problem, causing substantial and increasing costs in medical insurance programs. Due to large amounts of claims submitted, estimated at 5 billion per day, review of individual claims or providers is a difficult task. This encourages the employment of automated pre-payment controls and better post-payment decision support tools to enable subject matter expert analysis. This paper presents how to apply unsupervised outlier techniques at post-payment stage ...
33 CitationsSource
#1Rob M. Konijn (LEI: Leiden University)H-Index: 4
#2Wouter Duivesteijn (LEI: Leiden University)H-Index: 11
Last. Arno Knobbe (LEI: Leiden University)H-Index: 21
view all 4 authors...
We consider data where examples are not only labeled in the classical sense (positive or negative), but also have costs associated with them. In this sense, each example has two target attributes, and we aim to find clearly defined subsets of the data where the values of these two targets have an unusual distribution. In other words, we are focusing on a Subgroup Discovery task with a somewhat unusual target concept, and investigate quality measures that take into account both the binary and the...
4 CitationsSource
#1Terence Chai Cheng (University of Adelaide)H-Index: 9
#2John P. Haisken-DeNew (Melbourne Institute of Applied Economic and Social Research)H-Index: 23
Last. Jongsay Yong (Melbourne Institute of Applied Economic and Social Research)H-Index: 14
view all 3 authors...
The increasing prominence of the private sector in health care provision has generated considerable interest in understanding its implications on quality and cost. This paper investigates the phenomenon of cream skimming in a mixed public-private hospital setting using the novel approach of analysing hospital transfers.
20 CitationsSource
#1Hossein Joudaki (Tehran University of Medical Sciences)H-Index: 5
#2Arash Rashidian (Tehran University of Medical Sciences)H-Index: 51
Last. Mohammad Arab (Tehran University of Medical Sciences)H-Index: 20
view all 7 authors...
Inappropriate payments by insurance organizations or third party payers occur because of errors, abuse and fraud. The scale of this problem is large enough to make it a priority issue for health systems. Traditional methods of detecting health care fraud and abuse are time-consuming and inefficient. Combining automated methods and statistical knowledge lead to the emergence of a new interdisciplinary branch of science that is named Knowledge Discovery from Databases (KDD). Data mining is a core ...
55 CitationsSource
Apr 14, 2013 in PAKDD (Pacific-Asia Conference on Knowledge Discovery and Data Mining)
#1Rob M. Konijn (LEI: Leiden University)H-Index: 4
#2Wouter Duivesteijn (LEI: Leiden University)H-Index: 11
Last. Arno Knobbe (LEI: Leiden University)H-Index: 21
view all 4 authors...
In Subgroup Discovery, one is interested in finding subgroups that behave differently from the ‘average’ behavior of the entire population. In many cases, such an approach works well because the general population is rather homogeneous, and the subgroup encompasses clear outliers. In more complex situations however, the investigated population is a mixture of various subpopulations, and reporting all of these as interesting subgroups is undesirable, as the variation in behavior is explainable. I...
7 CitationsSource
#1Melih Kirlidog (NWU: North-West University)H-Index: 3
#2Cuneyt Asuk (Marmara University)H-Index: 1
Abstract Fraud can be seen in all insurance types including health insurance. Fraud in health insurance is done by intentional deception or misrepresentation for gaining some shabby benefit in the form of health expenditures. Data mining tools and techniques can be used to detect fraud in large sets of insurance claim data. Based on a few cases that are known or suspected to be fraudulent, the anomaly detection technique calculates the likelihood or probability of each record to be fraudulent by...
45 CitationsSource
#1Hyunjung Shin (Ajou University)H-Index: 25
#2Hayoung Park (SNU: Seoul National University)H-Index: 11
Last. Won Chul Jhee (Hongik University)H-Index: 2
view all 4 authors...
We propose a scoring model that detects outpatient clinics with abusive utilization patterns based on profiling information extracted from electronic insurance claims. The model consists of (1) scoring to quantify the degree of abusiveness and (2) segmentation to categorize the problematic providers with similar utilization patterns. We performed the modeling for 3705 Korean internal medicine clinics. We applied data from practitioner claims submitted to the National Health Insurance Corporation...
34 CitationsSource
Cited By2
#1Yu Min Wang (NCNU: National Chi Nan University)H-Index: 1
#2Chei Chang Chiou (NCUE: National Changhua University of Education)H-Index: 10
Last. Chun-Jung Chen (NCUE: National Changhua University of Education)H-Index: 1
view all 4 authors...
With the continuous progress and penetration of automated data collection technology, enterprises and organizations are facing the problem of information overload. The demand for expertise about data mining and analysis is increasing. The self-efficacy is a pivotal construct that has significant relationships with willingness and abilities to perform a particular task. Thus, the objective of this study is to develop an instrument for assessing self-efficacy in data mining and analysis. An initia...
1 CitationsSource
#1Ramesha Karunasena (Singapore Management University)
#2Mohammad Sarparajul Ambiya (Singapore Management University)H-Index: 1
Last. Milind Tambe (Google)H-Index: 90
view all 8 authors...
Data analytics has tremendous potential to provide targeted benefit in low-resource communities, however the availability of high-quality public health data is a significant challenge in developing countries primarily due to non-diligent data collection by community health workers (CHWs). In this work, we define and test a data collection diligence score. This challenging unlabeled data problem is handled by building upon domain expert's guidance to design a useful data representation of the raw...