Clustering data by identifying a subset of representative examples is important for processing sensory signals and detecting patterns in data. Such "exemplars" can be found by randomly choosing an initial subset of data points and then iteratively refining it, but this works well only if that initial choice is close to a good solution. We devised a method called "affinity propagation," which takes as input measures of similarity between pairs of data points. Real-valued messages are exchanged between data points until a high-quality set of exemplars and corresponding clusters gradually emerges. We used affinity propagation to cluster images of faces, detect genes in microarray data, identify representative sentences in this manuscript, and identify cities that are efficiently accessed by airline travel. Affinity propagation found clusters with much lower error than other methods, and it did so in less than one-hundredth the amount of time.

Recent mammalian microarray experiments detected widespread transcription and indicated that there may be many undiscovered multiple-exon protein-coding genes. To explore this possibility, we labeled cDNA from unamplified, polyadenylation-selected RNA samples from 37 mouse tissues to microarrays encompassing 1.14 million exon probes. We analyzed these data using GenRate, a Bayesian algorithm that uses a genome-wide scoring function in a factor graph to infer genes. At a stringent exon false dete...

Important inference problems in statistical physics, computer vision, error-correcting coding theory, and artificial intelligence can all be reformulated as the computation of marginal probabilities on factor graphs. The belief propagation (BP) algorithm is an efficient way to solve these problems that is exact when the factor graph is a tree, but only approximate when the factor graph has cycles. We show that BP fixed points correspond to the stationary points of the Bethe approximation of the ...

#1Sonia Jain(UCSD: University of California, San Diego)H-Index: 46

#2Radford M. Neal(UCSD: University of California, San Diego)H-Index: 41

This article proposes a split-merge Markov chain algorithm to address the problem of inefficient sampling for conjugate Dirichlet process mixture models. Traditional Markov chain Monte Carlo methods for Bayesian mixture models, such as Gibbs sampling, can become trapped in isolated modes corresponding to an inappropriate clustering of data points. This article describes a Metropolis-Hastings procedure that can escape such local modes by splitting or merging mixture components. Our algorithm empl...

#1Kim D. Pruitt(NIH: National Institutes of Health)H-Index: 36

#2Tatiana Tatusova(NIH: National Institutes of Health)H-Index: 43

Last. Donna Maglott(NIH: National Institutes of Health)H-Index: 46

view all 3 authors...

The goal of the NCBI Reference Sequence (RefSeq) project is to provide the single best non-redundant and comprehensive collection of naturally occurring biological molecules, representing the central dogma. Nucleotide and protein sequences are explicitly linked on a residue-by-residue basis in this collection. Ideally all molecule types will be available for each well-studied organism, but the initial database collection pragmatically includes only those molecules and organisms that are most rea...

Last. Riccardo Zecchina(ICTP: International Centre for Theoretical Physics)H-Index: 51

view all 3 authors...

We study the satisfiability of random Boolean expressions built from many clauses with K variables per clause (K-satisfiability). Expressions with a ratio α of clauses to variables less than a threshold α c are almost always satisfiable, whereas those with a ratio above this threshold are almost always unsatisfiable. We show the existence of an intermediate phase below α c , where the proliferation of metastable states is responsible for the onset of complexity in search algorithms. We introduce...

We present the first constant-factor approximation algorithm for the metric k-median problem. The k-median problem is one of the most well-studied clustering problems, i.e., those problems in which the aim is to partition a given set of points into clusters so that the points within a cluster are relatively close with respect to some measure. For the metric k-median problem, we are given n points in a metric space. We select k of these to be cluster centers and then assign each point to its clos...

Algorithms that must deal with complicated global functions of many variables often exploit the manner in which the given functions factor as a product of "local" functions, each of which depends on a subset of the variables. Such a factorization can be visualized with a bipartite graph that we call a factor graph, In this tutorial paper, we present a generic message-passing algorithm, the sum-product algorithm, that operates in a factor graph. Following a single, simple computational rule, the ...

#2Jitendra Malik(University of California, Berkeley)H-Index: 142

We propose a novel approach for solving the perceptual grouping problem in vision. Rather than focusing on local features and their consistencies in the image data, our approach aims at extracting the global impression of an image. We treat image segmentation as a graph partitioning problem and propose a novel global criterion, the normalized cut, for segmenting the graph. The normalized cut criterion measures both the total dissimilarity between the different groups as well as the total similar...

Statistical approaches to processing natural language text have become dominant in recent years. This foundational text is the first comprehensive introduction to statistical natural language processing (NLP) to appear. The book contains all the theory and algorithms needed for building NLP tools. It provides broad but rigorous coverage of mathematical and linguistic foundations, as well as detailed discussion of statistical methods, allowing students and researchers to construct their own imple...

Music recommender systems have become an integral part of music streaming services such as Spotify and Last.fm to assist users navigating the extensive music collections offered by them. However, while music listeners interested in mainstream music are traditionally served well by music recommender systems, users interested in music beyond the mainstream (i.e., non-popular music) rarely receive relevant recommendations. In this paper, we study the characteristics of beyond-mainstream music and m...

Abstract null null Managing risks related to the actions and conditions of the various elements that make up an operating room is a major concern during surgery. Determining alert thresholds is one of the main challenges. In this document, we propose to focus on the causes that lead to incidents as well as their prediction, which are essential elements in the determination of alerts. For that purpose, we have designed an architecture that couples a Multi-Agent System (MAS) with Case-Based Reason...

Abstract null null Scholar performance assessment plays an important role in reward evaluation, funding allocation, and promotion and recruitment decisions. However, raw publication counts and raw citation count-based scholar performance assessment indicators, such as H-index or author citations, have shortcomings; for example, they ignore the impact of different citation patterns under different research topics, leading to authorship credit inflation due to full citation allocation to each auth...

Abstract null null One of the cornerstones of geostatistics is the variogram. Fitting a variogram model to observed data is certainly one of the routine tasks in daily geostatistics practice. The central role of the variogram and the difficulty of performing variogram modeling manually call for an automatic, or at least a computer-assisted, method. Fitting a parametric curve to observational points in an XY chart is far from new and the GSLib software package has included a module (called VARFIT...

Last. Weiyong Liu(USTC: University of Science and Technology of China)

view all 3 authors...

Abstract null null Various studies have focused on the classification of uncertain or imbalanced data. However, previous studies rarely consider the classification for uncertain imbalanced data. To address this research gap, this study proposes an evidential reasoning (ER) based ensemble classifier (EREC). In the proposed method, an aff?nity propagation based oversampling method is developed to obtain the balanced class distributions of the training datasets for individual classifiers. Using the...

Abstract null null A robust bus transit network is of fundamental importance for sustainable development by alleviating urban problems. This paper aims to explore the robustness of 57 bus transit networks from the aspect of transferability. Bus transit networks are constructed using open-source data from the same data source for ensuring a consistent comparison, and the network robustness is analyzed using giant component (GC) to represent the maximum scale of transferability and network efficie...

Abstract null null Hierarchical clustering allows better performance in grouping heterogeneous and non-spherical datasets than the center-based clustering, at the expense of increased time complexity. Meanwhile, the bottom-up approach of hierarchical clustering methods often tend to be sensitive or vulnerable to datasets containing obscure cluster boundaries. This paper presents an effective method for hierarchical clustering, called HCNN, which utilizes two types of structural similarities in n...

In this paper, we propose a novel hierarchical representation via message propagation (HRMP) method for robust model fitting, which simultaneously takes advantages of both the consensus analysis and the preference analysis to estimate the parameters of multiple model instances from data corrupted by outliers, for robust model fitting. Instead of analyzing the information of each data point or each model hypothesis independently, we formulate the consensus information and the preference informati...

Abstract Coalbed methane (CBM) is a clean energy source. The prediction of CBM production is a critical step during CBM exploitation and utilization, especially for geological well selection, engineering decision making, and production management. In past attempts, CBM production prediction methods have been limited to numerical simulation and shallow neural network. Compared with numerical simulation and shallow neural network methods, deep learning has a significant advantage in its ability to...