Abstract PO-073: Using machine learning to identify the risk factors of pancreatic cancer from the PLCO dataset
Published on Mar 1, 2021in Clinical Cancer Research10.107
· DOI :10.1158/1557-3265.ADI21-PO-073
Background: Pancreatic cancer (PC) is a disease with poor prognosis and survival rate. There is a pertinent need to identify the risk factors of this disease. The purpose of this study is to use machine learning methods to identify a subset of factors (a.k.a. features) from the PLCO dataset as predictors of PC. The Prostate, Lung, Colorectal and Ovarian (PLCO) cancer dataset is collected by the National Cancer Institute from 155,000 participants (49.5% male). Each participant responded to three questionnaires consisting of 65 questions about demographics, illness history, and family background. Method: This is an optimal feature selection problem. The goal is to identify the subset of features that predict PC with highest probability. There are two challenges to solving this problem: (1) the problem is computationally intractable (there are n65 possible subsets of features where n is the number of values each feature can take on average), and (2) the PLCO dataset is highly imbalanced (only 0.48% participants have PC). The dataset was balanced by downsampling the majority class. Eleven methods were used for feature selection. Classification was done by 25 classifiers using the selected features from each of the 11 methods, thereby generating 11 × 25=275 results. All methods used for balancing, feature selection and classification are well-established in the field of machine learning. For each classifier, the baseline was obtained by classifying the balanced datasets using all features. The dataset was used 60% for training and 40% for testing. Hyperparameters were estimated via cross-validation on the training set. Results: Approximately 11% of the 275 classification results were accurate which were distributed across the different balancing, feature selection and classification methods. Among the 65 features, 17 were chosen by more than 50% of the feature selection methods. Among them, race, occupation (retired or not, indicative of age), smoking, prior history of any cancer, and number of relatives with PC were more discriminative than the others. Considering a subset of two features for male participants, probability of PC given age when told had inflamed prostate is 70+ and number of cigarettes smoked daily is 80+ was the highest (0.032) followed by age when told had inflamed prostate is 70+ and prior history of any cancer (0.03). For females, probability of PC given number of relatives with PC is 2+ and number of cigarettes smoked daily is 61-80 was the highest (0.156) followed by number of relatives with PC is 2+ and number of tubal/ectopic pregnancies is 2+ (0.137). Conclusions: The study found that age, smoking, prior history of cancer and relatives with cancer are the prominent risk factors of PC. Inflamed prostrate for males and tubal/ectopic pregnancies for females are also risk factors of PC. When two of these factors occur in conjunction, the risk of PC may increase even more. However, that is not necessarily the case when three or more of these factors occur in conjunction. Citation Format: Ananya Dutta, Bonny Banerjee, Sheema Khan, Subhash Chauhan. Using machine learning to identify the risk factors of pancreatic cancer from the PLCO dataset [abstract]. In: Proceedings of the AACR Virtual Special Conference on Artificial Intelligence, Diagnosis, and Imaging; 2021 Jan 13-14. Philadelphia (PA): AACR; Clin Cancer Res 2021;27(5_Suppl):Abstract nr PO-073.