Résumé
Biclustering, block clustering, Co-clustering or two-mode clustering is a data mining technique which allows simultaneous clustering of the rows and columns of a matrix. The term was first introduced by Boris Mirkin to name a technique introduced many years earlier, in 1972, by John A. Hartigan. Given a set of samples represented by an -dimensional feature vector, the entire dataset can be represented as rows in columns (i.e., an matrix). The Biclustering algorithm generates Biclusters. A Bicluster is a subset of rows which exhibit similar behavior across a subset of columns, or vice versa. Biclustering was originally introduced by John A. Hartigan in 1972. The term "Biclustering" was then later used and refined by Boris G. Mirkin. This algorithm was not generalized until 2000, when Y. Cheng and George M. Church proposed a biclustering algorithm based on variance and applied it to biological gene expression data. In 2001 and 2003, I. S. Dhillon published two algorithms applying biclustering to files and words. One version was based on bipartite spectral graph partitioning. The other was based on information theory. Dhillon assumed the loss of mutual information during biclustering was equal to the Kullback–Leibler-distance (KL-distance) between P and Q. P represents the distribution of files and feature words before Biclustering, while Q is the distribution after Biclustering. KL-distance is for measuring the difference between two random distributions. KL = 0 when the two distributions are the same and KL increases as the difference increases. Thus, the aim of the algorithm was to find the minimum KL-distance between P and Q. In 2004, Arindam Banerjee used a weighted-Bregman distance instead of KL-distance to design a Biclustering algorithm that was suitable for any kind of matrix, unlike the KL-distance algorithm. To cluster more than two types of objects, in 2005, Bekkerman expanded the mutual information in Dhillon's theorem from a single pair into multiple pairs.
À propos de ce résultat
Cette page est générée automatiquement et peut contenir des informations qui ne sont pas correctes, complètes, à jour ou pertinentes par rapport à votre recherche. Il en va de même pour toutes les autres pages de ce site. Veillez à vérifier les informations auprès des sources officielles de l'EPFL.