Clustering by Passing Messages Between Data Points

Published on Feb 16, 2007in Science41.845
· DOI :10.1126/SCIENCE.1136800
Brendan J. Frey74
Estimated H-index: 74
(U of T: University of Toronto),
Delbert Dueck8
Estimated H-index: 8
(U of T: University of Toronto)
Sources
Abstract
Clustering data by identifying a subset of representative examples is important for processing sensory signals and detecting patterns in data. Such "exemplars" can be found by randomly choosing an initial subset of data points and then iteratively refining it, but this works well only if that initial choice is close to a good solution. We devised a method called "affinity propagation," which takes as input measures of similarity between pairs of data points. Real-valued messages are exchanged between data points until a high-quality set of exemplars and corresponding clusters gradually emerges. We used affinity propagation to cluster images of faces, detect genes in microarray data, identify representative sentences in this manuscript, and identify cities that are efficiently accessed by airline travel. Affinity propagation found clusters with much lower error than other methods, and it did so in less than one-hundredth the amount of time.
📖 Papers frequently viewed together
1996KDD: Knowledge Discovery and Data Mining
4 Authors (Martin Ester, ..., Xiaowei Xu)
12.6k Citations
25k Citations
2010ICPR: International Conference on Pattern Recognition
1 Author (Anil K. Jain)
5,315 Citations
References15
Newest
#1Brendan J. Frey (U of T: University of Toronto)H-Index: 74
#2Naveed Mohammad (U of T: University of Toronto)H-Index: 6
Last. Timothy R. Hughes (U of T: University of Toronto)H-Index: 96
view all 14 authors...
Recent mammalian microarray experiments detected widespread transcription and indicated that there may be many undiscovered multiple-exon protein-coding genes. To explore this possibility, we labeled cDNA from unamplified, polyadenylation-selected RNA samples from 37 mouse tissues to microarrays encompassing 1.14 million exon probes. We analyzed these data using GenRate, a Bayesian algorithm that uses a genome-wide scoring function in a factor graph to infer genes. At a stringent exon false dete...
46 CitationsSource
#1Jonathan S. Yedidia (Mitsubishi)H-Index: 27
#2William T. FreemanH-Index: 112
Last. Yair WeissH-Index: 65
view all 3 authors...
Important inference problems in statistical physics, computer vision, error-correcting coding theory, and artificial intelligence can all be reformulated as the computation of marginal probabilities on factor graphs. The belief propagation (BP) algorithm is an efficient way to solve these problems that is exact when the factor graph is a tree, but only approximate when the factor graph has cycles. We show that BP fixed points correspond to the stationary points of the Bethe approximation of the ...
1,526 CitationsSource
#1Sonia Jain (UCSD: University of California, San Diego)H-Index: 46
#2Radford M. Neal (UCSD: University of California, San Diego)H-Index: 41
This article proposes a split-merge Markov chain algorithm to address the problem of inefficient sampling for conjugate Dirichlet process mixture models. Traditional Markov chain Monte Carlo methods for Bayesian mixture models, such as Gibbs sampling, can become trapped in isolated modes corresponding to an inappropriate clustering of data points. This article describes a Metropolis-Hastings procedure that can escape such local modes by splitting or merging mixture components. Our algorithm empl...
389 CitationsSource
#1Marc Mézard (University of Paris)H-Index: 79
Problems in computer science, such as error correction in information transfer and "satisfiability" in optimization, show phase transitions familiar from solid-state physics. In his Perspective, MA©zard explains how recent advances in these three fields originate in similar "message passing" procedures. The exchange of elaborate messages between different variables and constraints, used in the study of phase transitions in physical systems, helps to make error correction and satisfiability codes...
40 CitationsSource
#1Kim D. Pruitt (NIH: National Institutes of Health)H-Index: 36
#2Tatiana Tatusova (NIH: National Institutes of Health)H-Index: 43
Last. Donna Maglott (NIH: National Institutes of Health)H-Index: 46
view all 3 authors...
The goal of the NCBI Reference Sequence (RefSeq) project is to provide the single best non-redundant and comprehensive collection of naturally occurring biological molecules, representing the central dogma. Nucleotide and protein sequences are explicitly linked on a residue-by-residue basis in this collection. Ideally all molecule types will be available for each well-studied organism, but the initial database collection pragmatically includes only those molecules and organisms that are most rea...
158 CitationsSource
#1Marc Mézard (University of Paris-Sud)H-Index: 79
#2Giorgio Parisi (University of Paris-Sud)H-Index: 118
Last. Riccardo Zecchina (ICTP: International Centre for Theoretical Physics)H-Index: 51
view all 3 authors...
We study the satisfiability of random Boolean expressions built from many clauses with K variables per clause (K-satisfiability). Expressions with a ratio α of clauses to variables less than a threshold α c are almost always satisfiable, whereas those with a ratio above this threshold are almost always unsatisfiable. We show the existence of an intermediate phase below α c , where the proliferation of metastable states is responsible for the onset of complexity in search algorithms. We introduce...
874 CitationsSource
Aug 1, 2002 in STOC (Symposium on the Theory of Computing)
#1Moses Charikar (Stanford University)H-Index: 65
#2Sudipto Guha (Stanford University)H-Index: 63
Last. David B. Shmoys (Cornell University)H-Index: 67
view all 4 authors...
We present the first constant-factor approximation algorithm for the metric k-median problem. The k-median problem is one of the most well-studied clustering problems, i.e., those problems in which the aim is to partition a given set of points into clusters so that the points within a cluster are relatively close with respect to some measure. For the metric k-median problem, we are given n points in a metric space. We select k of these to be cluster centers and then assign each point to its clos...
464 CitationsSource
#1Frank R. Kschischang (U of T: University of Toronto)H-Index: 60
#2Brendan J. FreyH-Index: 74
Last. Hans-Andrea LoeligerH-Index: 28
view all 3 authors...
Algorithms that must deal with complicated global functions of many variables often exploit the manner in which the given functions factor as a product of "local" functions, each of which depends on a subset of the variables. Such a factorization can be visualized with a bipartite graph that we call a factor graph, In this tutorial paper, we present a generic message-passing algorithm, the sum-product algorithm, that operates in a factor graph. Following a single, simple computational rule, the ...
5,637 CitationsSource
#1Jianbo Shi (CMU: Carnegie Mellon University)H-Index: 48
#2Jitendra Malik (University of California, Berkeley)H-Index: 142
We propose a novel approach for solving the perceptual grouping problem in vision. Rather than focusing on local features and their consistencies in the image data, our approach aims at extracting the global impression of an image. We treat image segmentation as a graph partitioning problem and propose a novel global criterion, the normalized cut, for segmenting the graph. The normalized cut criterion measures both the total dissimilarity between the different groups as well as the total similar...
12.7k CitationsSource
#1Christopher D. Manning (Stanford University)H-Index: 137
#2Hinrich Schütze (PARC)H-Index: 1
Statistical approaches to processing natural language text have become dominant in recent years. This foundational text is the first comprehensive introduction to statistical natural language processing (NLP) to appear. The book contains all the theory and algorithms needed for building NLP tools. It provides broad but rigorous coverage of mathematical and linguistic foundations, as well as detailed discussion of statistical methods, allowing students and researchers to construct their own imple...
7,700 Citations
Cited By4912
Newest
#1Dominik KowaldH-Index: 14
#2Peter MuellnerH-Index: 1
Last. Elisabeth LexH-Index: 17
view all 6 authors...
Music recommender systems have become an integral part of music streaming services such as Spotify and Last.fm to assist users navigating the extensive music collections offered by them. However, while music listeners interested in mainstream music are traditionally served well by music recommender systems, users interested in music beyond the mainstream (i.e., non-popular music) rarely receive relevant recommendations. In this paper, we study the characteristics of beyond-mainstream music and m...
1 CitationsSource
#1Bruno Perez (CNRS: Centre national de la recherche scientifique)H-Index: 1
#1Bruno Perez (CNRS: Centre national de la recherche scientifique)H-Index: 1
Last. Frédéric AuberH-Index: 9
view all 5 authors...
Abstract null null Managing risks related to the actions and conditions of the various elements that make up an operating room is a major concern during surgery. Determining alert thresholds is one of the main challenges. In this document, we propose to focus on the causes that lead to incidents as well as their prediction, which are essential elements in the determination of alerts. For that purpose, we have designed an architecture that couples a Multi-Agent System (MAS) with Case-Based Reason...
Source
#1Qing Xie (XTU: Xiangtan University)H-Index: 2
#2Xinyuan Zhang (Zhengzhou University)H-Index: 2
Last. Min Song (Yonsei University)H-Index: 25
view all 3 authors...
Abstract null null Scholar performance assessment plays an important role in reward evaluation, funding allocation, and promotion and recruitment decisions. However, raw publication counts and raw citation count-based scholar performance assessment indicators, such as H-index or author citations, have shortcomings; for example, they ignore the impact of different citation patterns under different research topics, leading to authorship credit inflation due to full citation allocation to each auth...
Source
#1Paulo M. Carvalho (Petrobras)H-Index: 6
#2João Felipe Coimbra Leite Costa (UFRGS: Universidade Federal do Rio Grande do Sul)H-Index: 12
Abstract null null One of the cornerstones of geostatistics is the variogram. Fitting a variogram model to observed data is certainly one of the routine tasks in daily geostatistics practice. The central role of the variogram and the difficulty of performing variogram modeling manually call for an automatic, or at least a computer-assisted, method. Fitting a parametric curve to observational points in an XY chart is far from new and the GSLib software package has included a module (called VARFIT...
Source
#1Chao Fu (Hefei University of Technology)H-Index: 16
#2Qianshan Zhan (Hefei University of Technology)
Last. Weiyong Liu (USTC: University of Science and Technology of China)
view all 3 authors...
Abstract null null Various studies have focused on the classification of uncertain or imbalanced data. However, previous studies rarely consider the classification for uncertain imbalanced data. To address this research gap, this study proposes an evidential reasoning (ER) based ensemble classifier (EREC). In the proposed method, an aff?nity propagation based oversampling method is developed to obtain the balanced class distributions of the training datasets for individual classifiers. Using the...
Source
Abstract null null A robust bus transit network is of fundamental importance for sustainable development by alleviating urban problems. This paper aims to explore the robustness of 57 bus transit networks from the aspect of transferability. Bus transit networks are constructed using open-source data from the same data source for ensuring a consistent comparison, and the network robustness is analyzed using giant component (GC) to represent the maximum scale of transferability and network efficie...
Source
#1Sercan Ozcan (HSE: National Research University – Higher School of Economics)H-Index: 4
#2C. Okan Sakar (BAU: Bahçeşehir University)H-Index: 10
Last. Metin Suloglu (University of Leeds)H-Index: 1
view all 3 authors...
1 CitationsSource
#1Chunrong Wu (Chongqing University)H-Index: 4
#2Qinglan Peng (Chongqing University)H-Index: 4
Last. Yunni Xia (Chongqing University)H-Index: 25
view all 5 authors...
Abstract null null Hierarchical clustering allows better performance in grouping heterogeneous and non-spherical datasets than the center-based clustering, at the expense of increased time complexity. Meanwhile, the bottom-up approach of hierarchical clustering methods often tend to be sensitive or vulnerable to datasets containing obscure cluster boundaries. This paper presents an effective method for hierarchical clustering, called HCNN, which utilizes two types of structural similarities in n...
Source
In this paper, we propose a novel hierarchical representation via message propagation (HRMP) method for robust model fitting, which simultaneously takes advantages of both the consensus analysis and the preference analysis to estimate the parameters of multiple model instances from data corrupted by outliers, for robust model fitting. Instead of analyzing the information of each data point or each model hypothesis independently, we formulate the consensus information and the preference informati...
Source
#1Zixi Guo (UQ: University of Queensland)
#2Jinzhou Zhao (SWPU: Southwest Petroleum University)
Last. Yiyu Chen (PetroChina)
view all 6 authors...
Abstract Coalbed methane (CBM) is a clean energy source. The prediction of CBM production is a critical step during CBM exploitation and utilization, especially for geological well selection, engineering decision making, and production management. In past attempts, CBM production prediction methods have been limited to numerical simulation and shallow neural network. Compared with numerical simulation and shallow neural network methods, deep learning has a significant advantage in its ability to...
Source