Benchmarking network propagation methods for disease gene identification

Published on Sep 3, 2019in PLOS Computational Biology4.475
路 DOI :10.1371/JOURNAL.PCBI.1007276
Sergio Picart-Armada6
Estimated H-index: 6
(UPC: Polytechnic University of Catalonia),
Steven J. Barrett2
Estimated H-index: 2
+ 3 AuthorsBenoit H. Dessailly18
Estimated H-index: 18
Sources
Abstract
In-silico identification of potential target genes for disease is an essential aspect of drug target discovery. Recent studies suggest that successful targets can be found through by leveraging genetic, genomic and protein interaction information. Here, we systematically tested the ability of 12 varied algorithms, based on network propagation, to identify genes that have been targeted by any drug, on gene-disease data from 22 common non-cancerous diseases in OpenTargets. We considered two biological networks, six performance metrics and compared two types of input gene-disease association scores. The impact of the design factors in performance was quantified through additive explanatory models. Standard cross-validation led to over-optimistic performance estimates due to the presence of protein complexes. In order to obtain realistic estimates, we introduced two novel protein complex-aware cross-validation schemes. When seeding biological networks with known drug targets, machine learning and diffusion-based methods found around 2-4 true targets within the top 20 suggestions. Seeding the networks with genes associated to disease by genetics decreased performance below 1 true hit on average. The use of a larger network, although noisier, improved overall performance. We conclude that diffusion-based prioritisers and machine learning applied to diffusion-based features are suited for drug discovery in practice and improve over simpler neighbour-voting methods. We also demonstrate the large impact of choosing an adequate validation strategy and the definition of seed disease genes.
Download
馃摉 Papers frequently viewed together
5 Authors (Ping Luo, ..., Fang-Xiang Wu)
202013.50bioRxiv
References63
Newest
#1Samir Kanaan-Izquierdo (UPC: Polytechnic University of Catalonia)H-Index: 3
#2Andrey Ziyatdinov (Harvard University)H-Index: 11
Last. Alexandre Perera-Lluna (UPC: Polytechnic University of Catalonia)H-Index: 13
view all 4 authors...
Multiview datasets are the norm in bioinformatics, often under the label multi-omics. Multiview data is gathered from several experiments, measurements or feature sets available for the same subjects. Recent studies in pattern recognition have shown the advantage of using multiview methods of clustering and dimensionality reduction; however, none of these methods are readily available to the extent of our knowledge. Multiview extensions of four well-known pattern recognition methods are proposed...
Source
#1Shayan Tabe-Bordbar (UIUC: University of Illinois at Urbana鈥揅hampaign)H-Index: 4
#2Amin Emad (UIUC: University of Illinois at Urbana鈥揅hampaign)H-Index: 10
Last. Saurabh Sinha (UIUC: University of Illinois at Urbana鈥揅hampaign)H-Index: 49
view all 4 authors...
Cross-validation (CV) is a technique to assess the generalizability of a model to unseen data. This technique relies on assumptions that may not be satisfied when studying genomics datasets. For example, random CV (RCV) assumes that a randomly selected set of samples, the test set, well represents unseen data. This assumption doesn鈥檛 hold true where samples are obtained from different experimental conditions, and the goal is to learn regulatory relationships among the genes that generalize beyon...
Source
#1Justin K. Huang (UCSD: University of California, San Diego)H-Index: 9
#2Daniel E. Carlin (UCSD: University of California, San Diego)H-Index: 14
Last. Trey Ideker (UCSD: University of California, San Diego)H-Index: 98
view all 7 authors...
Summary Gene networks are rapidly growing in size and number, raising the question of which networks are most appropriate for particular applications. Here, we evaluate 21 human genome-wide interaction networks for their ability to recover 446 disease gene sets identified through literature curation, gene expression profiling, or genome-wide association studies. While all networks have some ability to recover disease genes, we observe a wide range of performance with STRING, ConsensusPathDB, and...
Source
#1Sergio Picart-Armada (UPC: Polytechnic University of Catalonia)H-Index: 6
#2Wesley K. Thompson (UCSD: University of California, San Diego)H-Index: 70
Last. Alexandre Perera-Lluna (UPC: Polytechnic University of Catalonia)H-Index: 13
view all 4 authors...
This is a pre-copyedited, author-produced version of an article accepted for publication in Bioinformatics following peer review. The version of record Sergio Picart-Armada, Wesley K Thompson, Alfonso Buil, Alexandre Perera-Lluna; diffuStats: an R package to compute diffusion-based scores on biological networks, Bioinformatics, Volume 34, Issue 3, 1 February 2018, Pages 533鈥534 is available online at: https://doi.org/10.1093/bioinformatics/btx632.
Source
#1Samir Kanaan-Izquierdo (UPC: Polytechnic University of Catalonia)H-Index: 3
#2Andrey Ziyatdinov (Harvard University)H-Index: 11
Last. Alexandre Perera-Lluna (UPC: Polytechnic University of Catalonia)H-Index: 13
view all 3 authors...
Abstract An ever-increasing number of data analysis problems include more than one view of the data, i.e. different measurement approaches to the population under study. In consequence, pattern analysis methods that deal appropriately with multiview data are becoming increasingly useful. In this paper, a novel multiview spectral clustering algorithm is presented (multiview spectral clustering by common eigenvectors, or MVSC-CEV), based on computing the common eigenvectors of the Laplacian matric...
Source
#1Bram Verstockt (University of Cambridge)H-Index: 21
#2Kenneth G. C. Smith (University of Cambridge)H-Index: 88
Last. James Lee (University of Cambridge)H-Index: 65
view all 3 authors...
: Over the course of the past decade, genome-wide association studies (GWAS) have revolutionised our understanding of complex disease genetics. One of the diseases that has benefitted most from this technology has been Crohn's disease (CD), with the identification of autophagy, the IL-17/IL-23 axis and innate lymphoid cells as key players in CD pathogenesis. Our increasing understanding of the genetic architecture of CD has also highlighted how a failure to suppress aberrant immune responses may...
Source
#1Sergio Picart-Armada (UPC: Polytechnic University of Catalonia)H-Index: 6
#2Francesc Fernandez-Albert (UPC: Polytechnic University of Catalonia)H-Index: 7
Last. Alexandre Perera-Lluna (UPC: Polytechnic University of Catalonia)H-Index: 13
view all 8 authors...
Metabolomics experiments identify metabolites whose abundance varies as the conditions under study change. Pathway enrichment tools help in the identification of key metabolic processes and in building a plausible biological explanation for these variations. Although several methods are available for pathway enrichment using experimental evidence, metabolomics does not yet have a comprehensive overview in a network layout at multiple molecular levels. We propose a novel pathway enrichment proced...
Source
#1Amira Al-Aamri (Khalifa University)H-Index: 3
#2Kamal Taha (Khalifa University)H-Index: 14
Last. Dirar Homouz (Khalifa University)H-Index: 16
view all 5 authors...
Text mining has become an important tool in bioinformatics research with the massive growth in the biomedical literature over the past decade. Mining the biomedical literature has resulted in an incredible number of computational algorithms that assist many bioinformatics researchers. In this paper, we present a text mining system called Gene Interaction Rare Event Miner (GIREM) that constructs gene-gene-interaction networks for human genome using information extracted from biomedical literature...
Source
#1Evan A. Boyle (Stanford University)H-Index: 25
#2Yang I. Li (Stanford University)H-Index: 30
Last. Jonathan K. Pritchard (Stanford University)H-Index: 97
view all 3 authors...
A central goal of genetics is to understand the links between genetic variation and disease. Intuitively, one might expect disease-causing variants to cluster into key pathways that drive disease etiology. But for complex traits, association signals tend to be spread across most of the genome鈥攊ncluding near many genes without an obvious connection to disease. We propose that gene regulatory networks are sufficiently interconnected such that all genes expressed in disease-relevant cells are liabl...
Source
#1Biaobin Jiang (Purdue University)H-Index: 6
#2Kyle Kloster (Purdue University)H-Index: 7
Last. Michael Gribskov (Purdue University)H-Index: 45
view all 4 authors...
Motivation: Diffusion-based network models are widely used for protein function prediction using protein network data and have been shown to outperform neighborhood-based and module-based methods. Recent studies have shown that integrating the hierarchical structure of the Gene Ontology (GO) data dramatically improves prediction accuracy. However, previous methods usually either used the GO hierarchy to refine the prediction results of multiple classifiers, or flattened the hierarchy into a func...
Source
Cited By19
Newest
#1Sarah C. Br眉ningk (ETH Zurich)H-Index: 5
Last. Heiko Enderling (USF: University of South Florida)H-Index: 29
view all 7 authors...
Recurrent high grade glioma patients face a poor prognosis for which no curative treatment option currently exists. In contrast to prescribing high dose hypofractionated stereotactic radiotherapy (HFSRT, [Formula: see text] Gy [Formula: see text] 5 in daily fractions) with debulking intent, we suggest a personalized treatment strategy to improve tumor control by delivering high dose intermittent radiation treatment (iRT, [Formula: see text] Gy [Formula: see text] 1 every 6 weeks). We performed a...
Source
#1S. Dotolo (UNISA: University of Salerno)H-Index: 2
#2Anna Marabotti (UNISA: University of Salerno)H-Index: 23
Last. Roberto Tagliaferri (UNISA: University of Salerno)H-Index: 26
view all 10 authors...
MOTIVATION Assessment of genetic mutations is an essential element in the modern era of personalized cancer treatment. Our strategy is focused on 'multiple network analysis' in which we try to improve cancer diagnostics by using biological networks. Genetic alterations in some important hubs or in driver genes such as BRAF and TP53 play a critical role in regulating many important molecular processes. Most of the studies are focused on the analysis of the effects of single mutations, while tumor...
Source
#1Sergio Picart-Armada (UPC: Polytechnic University of Catalonia)H-Index: 6
#2Wesley K. ThompsonH-Index: 70
Last. Alexandre Perera-Lluna (UPC: Polytechnic University of Catalonia)H-Index: 13
view all 4 authors...
MOTIVATION Network diffusion and label propagation are fundamental tools in computational biology, with applications like gene-disease association, protein function prediction and module discovery. More recently, several publications have introduced a permutation analysis after the propagation process, due to concerns that network topology can bias diffusion scores. This opens the question of the statistical properties and the presence of bias of such diffusion processes in each of its applicati...
Source
#1Angela Lopez-del Rio (UPC: Polytechnic University of Catalonia)H-Index: 2
#2Sergio Picart-Armada (UPC: Polytechnic University of Catalonia)H-Index: 6
Last. Alexandre Perera-Lluna (UPC: Polytechnic University of Catalonia)H-Index: 13
view all 3 authors...
In silico analysis of biological activity data has become an essential technique in pharmaceutical development. Specifically, the so-called proteochemometric models aim to share information between targets in machine learning ligand-target activity prediction models. However, bioactivity data sets used in proteochemometric modeling are usually imbalanced, which could potentially affect the performance of the models. In this work, we explored the effect of different balancing strategies in deep l...
Source
#2Apichat Suratanee (King Mongkut's University of Technology North Bangkok)H-Index: 8
#3Kitiporn Plaimas (Chula: Chulalongkorn University)H-Index: 8
Disease-related gene prioritization is one of the most well-established pharmaceutical techniques used to identify genes that are important to a biological process relevant to a disease. In identifying these essential genes, the network diffusion (ND) approach is a widely used technique applied in gene prioritization. However, there is still a large number of candidate genes that need to be evaluated experimentally. Therefore, it would be of great value to develop a new strategy to improve the p...
Source
#1David Ochoa (EMBL-EBI: European Bioinformatics Institute)H-Index: 13
#2Andrew Hercules (EMBL-EBI: European Bioinformatics Institute)H-Index: 5
Last. Ian Dunham (EMBL-EBI: European Bioinformatics Institute)H-Index: 73
view all 32 authors...
The Open Targets Platform (https://www.targetvalidation.org/) provides users with a queryable knowledgebase and user interface to aid systematic target identification and prioritisation for drug discovery based upon underlying evidence. It is publicly available and the underlying code is open source. Since our last update two years ago, we have had 10 releases to maintain and continuously improve evidence for target-disease relationships from 20 different data sources. In addition, we have integ...
Source
#1Amanda Fern谩ndez-Fontelo (Humboldt University of Berlin)H-Index: 6
#2David Mori帽aH-Index: 14
Last. Pere Puig (Autonomous University of Barcelona)H-Index: 60
view all 5 authors...
The present paper introduces a new model used to study and analyse the severe acute respiratory syndrome coronavirus 2 (SARS-CoV2) epidemic-reported-data from Spain. This is a Hidden Markov Model whose hidden layer is a regeneration process with Poisson immigration, Po-INAR(1), together with a mechanism that allows the estimation of the under-reporting in non-stationary count time series. A novelty of the model is that the expectation of the innovations in the unobserved process is a time-depend...
Source
#1Rhea Mary Josi (SRM University)
#2R. I. Minu (SRM University)
Parkinson's disease is a chronic neuro-degenerative disease that affects the central nervous system. Since the causes of the disease are still unknown, both genetic and environmental factors are believed to be involved. Therefore, the identification and prediction of this genetic disorder plays an important role in early detection and treatment. The genomic data is extremely large and distorted, so the use of R tools can make the analysis and pre- processing of data easier. There are several com...
Source
#1Aidan MacNamaraH-Index: 11
#2Nikolina NakicH-Index: 7
Last. Alex GutteridgeH-Index: 25
view all 7 authors...
Genetic evidence of disease association has often been used as a basis for selecting of drug targets for complex common diseases. Likewise, the propagation of genetic evidence through gene or protein interaction networks has been shown to accurately infer novel disease associations at genes for which no direct genetic evidence can be observed. However, an empirical test of the utility of combining these approaches for drug discovery has been lacking. In this study, we examine genetic association...
Source
#1Angela Lopez-del Rio (UPC: Polytechnic University of Catalonia)H-Index: 2
#2Maria Jesus Martin (EMBL-EBI: European Bioinformatics Institute)H-Index: 44
Last. Rabie Saidi (EMBL-EBI: European Bioinformatics Institute)H-Index: 12
view all 4 authors...
The use of raw amino acid sequences as input for deep learning models for protein functional prediction has gained popularity in recent years. This scheme obliges to manage proteins with different lengths, while deep learning models require same-shape input. To accomplish this, zeros are usually added to each sequence up to a established common length in a process called zero-padding. However, the effect of different padding strategies on model performance and data structure is yet unknown. We p...
Source
This website uses cookies.
We use cookies to improve your online experience. By continuing to use our website we assume you agree to the placement of these cookies.
To learn more, you can find in our Privacy Policy.