The effect of statistical normalization on network propagation scores.

Published on May 5, 2021in Bioinformatics5.61
· DOI :10.1093/BIOINFORMATICS/BTAA896
Sergio Picart-Armada5
Estimated H-index: 5
(UPC: Polytechnic University of Catalonia),
Wesley K. Thompson68
Estimated H-index: 68
+ 1 AuthorsAlexandre Perera-Lluna12
Estimated H-index: 12
(UPC: Polytechnic University of Catalonia)
Sources
Abstract
MOTIVATION Network diffusion and label propagation are fundamental tools in computational biology, with applications like gene-disease association, protein function prediction and module discovery. More recently, several publications have introduced a permutation analysis after the propagation process, due to concerns that network topology can bias diffusion scores. This opens the question of the statistical properties and the presence of bias of such diffusion processes in each of its applications. In this work, we characterized some common null models behind the permutation analysis and the statistical properties of the diffusion scores. We benchmarked seven diffusion scores on three case studies: synthetic signals on a yeast interactome, simulated differential gene expression on a protein-protein interaction network and prospective gene set prediction on another interaction network. For clarity, all the datasets were based on binary labels, but we also present theoretical results for quantitative labels. RESULTS Diffusion scores starting from binary labels were affected by the label codification and exhibited a problem-dependent topological bias that could be removed by the statistical normalization. Parametric and non-parametric normalization addressed both points by being codification-independent and by equalizing the bias. We identified and quantified two sources of bias-mean value and variance-that yielded performance differences when normalizing the scores. We provided closed formulae for both and showed how the null covariance is related to the spectral properties of the graph. Despite none of the proposed scores systematically outperformed the others, normalization was preferred when the sought positive labels were not aligned with the bias. We conclude that the decision on bias removal should be problem and data-driven, i.e. based on a quantitative analysis of the bias and its relation to the positive entities. AVAILABILITY The code is publicly available at https://github.com/b2slab/diffuBench and the data underlying this article are available at https://github.com/b2slab/retroData. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Download
References34
Newest
#1Mengying Sun (MSU: Michigan State University)H-Index: 4
#2Sendong Zhao (Cornell University)H-Index: 9
Last. Fei Wang (Cornell University)H-Index: 117
view all 6 authors...
: Despite the fact that deep learning has achieved remarkable success in various domains over the past decade, its application in molecular informatics and drug discovery is still limited. Recent advances in adapting deep architectures to structured data have opened a new paradigm for pharmaceutical research. In this survey, we provide a systematic review on the emerging field of graph convolutional networks and their applications in drug discovery and molecular informatics. Typically we are int...
44 CitationsSource
#1Abby Hill (Novartis)H-Index: 1
#2Scott Gleim (Novartis)H-Index: 4
Last. Melody Morris (Novartis)H-Index: 3
view all 7 authors...
Computational approaches have shown promise in contextualizing genes of interest with known molecular interactions. In this work, we evaluate seventeen previously published algorithms based on characteristics of their output and their performance in three tasks: cross validation, prediction of drug targets, and behavior with random input. Our work highlights strengths and weaknesses of each algorithm and results in a recommendation of algorithms best suited for performing different tasks.
5 CitationsSource
#1Sergio Picart-Armada (UPC: Polytechnic University of Catalonia)H-Index: 5
#2Steven J. BarrettH-Index: 1
Last. Benoit H. DessaillyH-Index: 18
view all 6 authors...
In-silico identification of potential target genes for disease is an essential aspect of drug target discovery. Recent studies suggest that successful targets can be found through by leveraging genetic, genomic and protein interaction information. Here, we systematically tested the ability of 12 varied algorithms, based on network propagation, to identify genes that have been targeted by any drug, on gene-disease data from 22 common non-cancerous diseases in OpenTargets. We considered two biolog...
20 CitationsSource
May 13, 2019 in WWW (The Web Conference)
#1Rania Ibrahim (Purdue University)H-Index: 6
#2David F. Gleich (Purdue University)H-Index: 31
Diffusions, such as the heat kernel diffusion and the PageRank vector, and their relatives are widely used graph mining primitives that have been successful in a variety of contexts including community detection and semi-supervised learning. The majority of existing methods and methodology involves linear diffusions, which then yield simple algorithms involving repeated matrix-vector operations. Recent work, however, has shown that sophisticated and complicated techniques based on network embedd...
4 CitationsSource
Last. Alexandre Perera-Lluna (UPC: Polytechnic University of Catalonia)H-Index: 12
view all 4 authors...
Binding prediction between targets and drug-like compounds through deep neural networks has generated promising results in recent years, outperforming traditional machine learning-based methods. However, the generalization capability of these classification models is still an issue to be addressed. In this work, we explored how different cross-validation strategies applied to data from different molecular databases affect to the performance of binding prediction proteochemometrics models. These ...
8 CitationsSource
#1Hadas Biran (TAU: Tel Aviv University)H-Index: 3
#2Martin Kupiec (TAU: Tel Aviv University)H-Index: 58
Last. Roded Sharan (TAU: Tel Aviv University)H-Index: 67
view all 3 authors...
Network propagation is a central tool in biological research. While a number of variants and normalizations have been proposed for this method, each has its own shortcomings and no large scale assessment of those variants is available. Here we propose a novel normalization method for network propagation that is based on evaluating the propagation results against those obtained on randomized networks that preserve node degrees. In this way, our method overcomes potential biases of previous method...
9 CitationsSource
#1Sergio Picart-Armada (UPC: Polytechnic University of Catalonia)H-Index: 5
#2Wesley K. Thompson (UCSD: University of California, San Diego)H-Index: 68
Last. Alexandre Perera-Lluna (UPC: Polytechnic University of Catalonia)H-Index: 12
view all 4 authors...
This is a pre-copyedited, author-produced version of an article accepted for publication in Bioinformatics following peer review. The version of record Sergio Picart-Armada, Wesley K Thompson, Alfonso Buil, Alexandre Perera-Lluna; diffuStats: an R package to compute diffusion-based scores on biological networks, Bioinformatics, Volume 34, Issue 3, 1 February 2018, Pages 533–534 is available online at: https://doi.org/10.1093/bioinformatics/btx632.
15 CitationsSource
#1Sergio Picart-Armada (UPC: Polytechnic University of Catalonia)H-Index: 5
#2Francesc Fernandez-Albert (UPC: Polytechnic University of Catalonia)H-Index: 7
Last. Alexandre Perera-Lluna (UPC: Polytechnic University of Catalonia)H-Index: 12
view all 8 authors...
Metabolomics experiments identify metabolites whose abundance varies as the conditions under study change. Pathway enrichment tools help in the identification of key metabolic processes and in building a plausible biological explanation for these variations. Although several methods are available for pathway enrichment using experimental evidence, metabolomics does not yet have a comprehensive overview in a network layout at multiple molecular levels. We propose a novel pathway enrichment proced...
12 CitationsSource
#1Biaobin Jiang (Purdue University)H-Index: 6
#2Kyle Kloster (Purdue University)H-Index: 8
Last. Michael Gribskov (Purdue University)H-Index: 13
view all 4 authors...
Motivation: Diffusion-based network models are widely used for protein function prediction using protein network data and have been shown to outperform neighborhood-based and module-based methods. Recent studies have shown that integrating the hierarchical structure of the Gene Ontology (GO) data dramatically improves prediction accuracy. However, previous methods usually either used the GO hierarchy to refine the prediction results of multiple classifiers, or flattened the hierarchy into a func...
19 CitationsSource
#1Lenore J. Cowen (Tufts University)H-Index: 26
#2Trey Ideker (UCSD: University of California, San Diego)H-Index: 98
Last. Roded Sharan (TAU: Tel Aviv University)H-Index: 67
view all 4 authors...
Network propagation is based on the principle that genes underlying similar phenotypes are more likely to interact with each other. It is proving to be a powerful approach for extracting biological information from molecular networks that is relevant to human disease.
260 CitationsSource
Cited By0
Newest