Evaluation of Cross-Validation Strategies in Sequence-Based Binding Prediction Using Deep Learning

Published on Feb 7, 2019in Journal of Chemical Information and Modeling4.549
· DOI :10.1021/ACS.JCIM.8B00663
Angela Lopez-del Rio1
Estimated H-index: 1
,
Alfons Nonell-Canals1
Estimated H-index: 1
+ 1 AuthorsAlexandre Perera-Lluna12
Estimated H-index: 12
(UPC: Polytechnic University of Catalonia)
Sources
Abstract
Binding prediction between targets and drug-like compounds through deep neural networks has generated promising results in recent years, outperforming traditional machine learning-based methods. However, the generalization capability of these classification models is still an issue to be addressed. In this work, we explored how different cross-validation strategies applied to data from different molecular databases affect to the performance of binding prediction proteochemometrics models. These strategies are (1) random splitting, (2) splitting based on K-means clustering (both of actives and inactives), (3) splitting based on source database, and (4) splitting based both in the clustering and in the source database. These schemas are applied to a deep learning proteochemometrics model and to a simple logistic regression model to be used as baseline. Additionally, two different ways of describing molecules in the model are tested: (1) by their SMILES and (2) by three fingerprints. The classification perfo...
📖 Papers frequently viewed together
4 Citations
5 Authors (Nansu Zong, ..., Ning Li)
2 Citations
References41
Newest
#1Andreas Mayr (Johannes Kepler University of Linz)H-Index: 12
#2Günter Klambauer (Johannes Kepler University of Linz)H-Index: 20
Last. Sepp Hochreiter (Johannes Kepler University of Linz)H-Index: 39
view all 8 authors...
Deep learning is currently the most successful machine learning technique in a wide range of application areas and has recently been applied successfully in drug discovery research to predict potential drug targets and to screen for active molecules. However, due to (1) the lack of large-scale studies, (2) the compound series bias that is characteristic of drug discovery datasets and (3) the hyperparameter selection bias that comes with the high number of potential deep learning architectures, i...
163 CitationsSource
Undetected overfitting can occur when there are significant redundancies between training and validation data. We describe AVE, a new measure of training–validation redundancy for ligand-based classification problems, that accounts for the similarity among inactive molecules as well as active ones. We investigated seven widely used benchmarks for virtual screening and classification, and we show that the amount of AVE bias strongly correlates with the performance of ligand-based predictive metho...
57 CitationsSource
#2Alexander Rosenberg Johansen (DTU: Technical University of Denmark)H-Index: 6
Last. Søren Kaae Sønderby (UCPH: University of Copenhagen)H-Index: 16
view all 8 authors...
Deep neural network architectures such as convolutional and long short-term memory networks have become increasingly popular as machine learning tools during the recent years. The availability of greater computational resources, more data, new algorithms for training deep models and easy to use libraries for implementation and training of neural networks are the drivers of this development. The use of deep learning has been especially successful in image recognition; and the development of tools...
75 CitationsSource
Aug 20, 2017 in BIOINFORMATICS (International Conference on Bioinformatics)
#1Sunyoung Kwon (SNU: Seoul National University)H-Index: 6
#2Sungroh Yoon (SNU: Seoul National University)H-Index: 31
Chemical-chemical interaction (CCI) plays a key role in predicting candidate drugs, toxicity, therapeutic effects, and biological functions. In various types of chemical analyses, computational approaches are often required due to the amount of data that needs to be handled. The recent remarkable growth and outstanding performance of deep learning have attracted considerable research attention. However, even in state-of-the-art drug analysis methods, deep learning continues to be used only as a ...
23 CitationsSource
#1Eelke B. Lenselink (LEI: Leiden University)H-Index: 14
#2Niels ten Dijke (LEI: Leiden University)H-Index: 1
Last. Gerard J. P. van Westen (LEI: Leiden University)H-Index: 24
view all 8 authors...
The increase of publicly available bioactivity data in recent years has fueled and catalyzed research in chemogenomics, data mining, and modeling approaches. As a direct result, over the past few years a multitude of different methods have been reported and evaluated, such as target fishing, nearest neighbor similarity-based methods, and Quantitative Structure Activity Relationship (QSAR)-based protocols. However, such studies are typically conducted on different datasets, using different valida...
113 CitationsSource
#1Han Altae-Tran (MIT: Massachusetts Institute of Technology)H-Index: 5
#2Bharath Ramsundar (Stanford University)H-Index: 13
Last. Vijay S. Pande (Stanford University)H-Index: 106
view all 4 authors...
Recent advances in machine learning have made significant contributions to drug discovery. Deep neural networks in particular have been demonstrated to provide significant boosts in predictive power when inferring the properties and activities of small-molecule compounds (Ma, J. et al. J. Chem. Inf. Model. 2015, 55, 263–274). However, the applicability of these techniques has been limited by the requirement for large amounts of training data. In this work, we demonstrate how one-shot learning ca...
284 CitationsSource
#1Artur Kadurin (Mail.Ru Group)H-Index: 6
#2Alexander Aliper (Johns Hopkins University)H-Index: 30
Last. Alex Zhavoronkov (Johns Hopkins University)H-Index: 44
view all 7 authors...
// Artur Kadurin 1, 2, 3, 4 , Alexander Aliper 2 , Andrey Kazennov 2, 7 , Polina Mamoshina 2, 5 , Quentin Vanhaelen 2 , Kuzma Khrabrov 1 , Alex Zhavoronkov 2, 6, 7 1 Search Department, Mail.Ru Group Ltd., Moscow, Russia 2 Pharmaceutical Artificial Intelligence Department, Insilico Medicine, Inc., Emerging Technology Centers, Johns Hopkins University at Eastern, Baltimore, Maryland, USA 3 Big Data and Text Analysis Laboratory, Kazan Federal University, Kazan, Republic of Tatarstan, Russia 4 St. P...
154 CitationsSource
#1Tianyi Qiu (Tongji University)H-Index: 6
#2Jingxuan Qiu (Tongji University)H-Index: 6
Last. Ruixin Zhu (Tongji University)H-Index: 24
view all 8 authors...
: As an extension of the conventional quantitative structure activity relationship models, proteochemometric (PCM) modelling is a computational method that can predict the bioactivity relations between multiple ligands and multiple targets. Traditional PCM modelling includes three essential elements: descriptors (including target descriptors, ligand descriptors and cross-term descriptors), bioactivity data and appropriate learning functions that link the descriptors to the bioactivity data. Sinc...
27 CitationsSource
#1Kai Tian (Fudan University)H-Index: 6
#2Mingyu Shao (Fudan University)H-Index: 3
Last. Shuigeng Zhou (Fudan University)H-Index: 52
view all 5 authors...
The identification of interactions between compounds and proteins plays an important role in network pharmacology and drug discovery. However, experimentally identifying compound-protein interactions (CPIs) is generally expensive and time-consuming, computational approaches are thus introduced. Among these, machine-learning based methods have achieved a considerable success. However, due to the nonlinear and imbalanced nature of biological data, many machine learning approaches have their own li...
82 CitationsSource
#1Christof Angermueller (EMBL-EBI: European Bioinformatics Institute)H-Index: 10
#2Tanel Pärnamaa (UT: University of Tartu)H-Index: 4
Last. Oliver Stegle (EMBL-EBI: European Bioinformatics Institute)H-Index: 67
view all 4 authors...
Technological advances in genomics and imaging have led to an explosion of molecular and cellular profiling data from large numbers of samples. This rapid increase in biological data dimension and acquisition rate is challenging conventional analysis strategies. Modern machine learning methods, such as deep learning, promise to leverage very large data sets for finding hidden structure within them, and for making accurate predictions. In this review, we discuss applications of this new breed of ...
719 CitationsSource
Cited By9
Newest
#1Sergio Picart-Armada (UPC: Polytechnic University of Catalonia)H-Index: 5
#2Wesley K. ThompsonH-Index: 68
Last. Alexandre Perera-Lluna (UPC: Polytechnic University of Catalonia)H-Index: 12
view all 4 authors...
MOTIVATION Network diffusion and label propagation are fundamental tools in computational biology, with applications like gene-disease association, protein function prediction and module discovery. More recently, several publications have introduced a permutation analysis after the propagation process, due to concerns that network topology can bias diffusion scores. This opens the question of the statistical properties and the presence of bias of such diffusion processes in each of its applicati...
1 CitationsSource
#1Angela Lopez-del Rio (UPC: Polytechnic University of Catalonia)H-Index: 1
#2Sergio Picart-Armada (UPC: Polytechnic University of Catalonia)H-Index: 5
Last. Alexandre Perera-Lluna (UPC: Polytechnic University of Catalonia)H-Index: 12
view all 3 authors...
In silico analysis of biological activity data has become an essential technique in pharmaceutical development. Specifically, the so-called proteochemometric models aim to share information between targets in machine learning ligand-target activity prediction models. However, bioactivity data sets used in proteochemometric modeling are usually imbalanced, which could potentially affect the performance of the models. In this work, we explored the effect of different balancing strategies in deep l...
Source
Purpose Build machine learning models for predicting pressure ulcer nursing adverse event, and find an optimal model that predicts the occurrence of pressure ulcer accurately. Patients and methods Retrospectively enrolled 5814 patients, of which 1673 suffer from pressure ulcer events. Support vector machine (SVM), decision tree (DT), random forest (RF) and artificial neural network (ANN) models were used to construct the pressure ulcer prediction models, respectively. A total of 19 variables are...
Source
#1Stephen BonnerH-Index: 1
#2Ian P. BarrettH-Index: 9
Last. William L. HamiltonH-Index: 28
view all 7 authors...
Drug discovery and development is an extremely complex process, with high attrition contributing to the costs of delivering new medicines to patients. Recently, various machine learning approaches have been proposed and investigated to help improve the effectiveness and speed of multiple stages of the drug discovery pipeline. Among these techniques, it is especially those using Knowledge Graphs that are proving to have considerable promise across a range of tasks, including drug repurposing, dru...
4 Citations
One of the main challenges in drug discovery is predicting protein–ligand binding affinity. Recently, machine learning approaches have made substantial progress on this task. However, current metho...
6 CitationsSource
#1Angela Lopez-del Rio (UPC: Polytechnic University of Catalonia)H-Index: 1
#2Maria Jesus Martin (EMBL-EBI: European Bioinformatics Institute)H-Index: 41
Last. Rabie Saidi (EMBL-EBI: European Bioinformatics Institute)H-Index: 11
view all 4 authors...
The use of raw amino acid sequences as input for deep learning models for protein functional prediction has gained popularity in recent years. This scheme obliges to manage proteins with different lengths, while deep learning models require same-shape input. To accomplish this, zeros are usually added to each sequence up to a established common length in a process called zero-padding. However, the effect of different padding strategies on model performance and data structure is yet unknown. We p...
1 CitationsSource
#1Francoeur PH-Index: 1
#2Masuda TH-Index: 1
Last. Koes DrH-Index: 1
view all 3 authors...
1 CitationsSource
#1Sergio Picart-Armada (UPC: Polytechnic University of Catalonia)H-Index: 5
#2Wesley K. Thompson (UCSD: University of California, San Diego)H-Index: 68
Last. Alexandre Perera-Lluna (UPC: Polytechnic University of Catalonia)H-Index: 12
view all 4 authors...
Motivation: Network diffusion and label propagation are fundamental tools in computational biology, with applications like gene-disease association, protein function prediction and module discovery. More recently, several publications have introduced a permutation analysis after the propagation process, due to concerns that network topology can bias diffusion scores. This opens the question of the statistical properties and the presence of bias of such diffusion processes in each of its applicat...
Source
#1Brandon J. Bongers (LEI: Leiden University)H-Index: 2
#2Adriaan P. IJzerman (LEI: Leiden University)H-Index: 72
Last. Gerard J. P. van Westen (LEI: Leiden University)H-Index: 24
view all 3 authors...
Proteochemometrics is a machine learning based modeling approach relying on a combination of ligand and protein descriptors. With ongoing developments in machine learning and increases in public data the technique is more frequently applied in early drug discovery, typically in ligand–target binding prediction. Common applications include improvements to single target quantitative structure-activity relationship models, protein selectivity and promiscuity modeling, and large-scale deep learning ...
1 CitationsSource
#1Nicholas T. Cockroft (OSU: Ohio State University)H-Index: 3
#2Xiaolin Cheng (OSU: Ohio State University)H-Index: 29
Last. James R. Fuchs (OSU: Ohio State University)H-Index: 40
view all 3 authors...
Target fishing is the process of identifying the protein target of a bioactive small molecule. To do so experimentally requires a significant investment of time and resources, which can be expedited with a reliable computational target fishing model. The development of computational target fishing models using machine learning has become very popular over the last several years due to the increased availability of large amounts of public bioactivity data. Unfortunately, the applicability and per...
11 CitationsSource