**Are you an EPFL student looking for a semester project?**

Work with us on data science and visualisation projects, and deploy your project as an app on top of GraphSearch.

Concept# K-means clustering

Summary

k-means clustering is a method of vector quantization, originally from signal processing, that aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean (cluster centers or cluster centroid), serving as a prototype of the cluster. This results in a partitioning of the data space into Voronoi cells. k-means clustering minimizes within-cluster variances (squared Euclidean distances), but not regular Euclidean distances, which would be the more difficult Weber problem: the mean optimizes squared errors, whereas only the geometric median minimizes Euclidean distances. For instance, better Euclidean solutions can be found using k-medians and k-medoids.
The problem is computationally difficult (NP-hard); however, efficient heuristic algorithms converge quickly to a local optimum. These are usually similar to the expectation-maximization algorithm for mixtures of Gaussian distributions via an iterative refinement approach employed

Official source

This page is automatically generated and may contain information that is not correct, complete, up-to-date, or relevant to your search query. The same applies to every other page on this website. Please make sure to verify the information with EPFL's official sources.

Related publications

Loading

Related people

Loading

Related units

Loading

Related concepts

Loading

Related courses

Loading

Related lectures

Loading

Related publications (100)

Loading

Loading

Loading

Related people (22)

Related concepts (30)

Cluster analysis

Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense) to each other than to those in

Principal component analysis

Principal component analysis (PCA) is a popular technique for analyzing large datasets containing a high number of dimensions/features per observation, increasing the interpretability of data while pr

Unsupervised learning

Unsupervised learning, is paradigm in machine learning where, in contrast to supervised learning and semi-supervised learning, algorithms learn patterns exclusively from unlabeled data.
Neural n

Related courses (76)

BIO-322: Introduction to machine learning for bioengineers

Students understand basic concepts and methods of machine learning. They can describe them in mathematical terms and can apply them to data using a high-level programming language (julia/python/R).

DH-406: Machine learning for DH

This course aims to introduce the basic principles of machine learning in the context of the digital humanities. We will cover both supervised and unsupervised learning techniques, and study and implement methods to analyze diverse data types, such as images, music and social network data.

MICRO-455: Applied machine learning

Real-world engineering applications must cope with a large dataset of dynamic variables, which cannot be well approximated by classical or deterministic models. This course gives an overview of methods from Machine Learning for the analysis of non-linear, highly noisy and multi dimensional data

Related units (13)

Related lectures (175)

This master thesis provides an extended study of winter precipitations in Valais (Switzerland). It was rst tried to relate local conditions in Valais, as measured at gauges, to the Hess-Brezowski classication for Europe. It was not possible to clearly relate this classication to local precipitation events in Winter in this region. Self-organizing maps were used to try to distinguish between sub-regions only based on the precipitation regime in Valais. The obtained results present coherence with the existing literature. ERA-Interim (global atmospheric reanalysis) data were then used to develop a new classication to link European meteorological conditions to local events. Three unsupervised clustering techniques were used for this purpose: k-means, k-medoids, hierachical clustering. The optimal number of cluster was determined using a com- bination four commonly used internal clusters indices (Dunn index, Davies-Bouldin index, silhouette index and Calinski-Harabasz index). Two clusters appeared to be optimal compared to the various parameters of the problem. The outputs of the three classication techniques were compared in term of robustness and results. A physical interpretation was then made. It was found that the two classes were associated respectively with small or large precipitation events. The associated synoptic conditions at the European scale reveal that the small precipitation class is associated with a North-East ux over Switzerland whereas the large precipitation class is linked to a South-West ux linked to a stationary front over Switzerland.

2017Multimedia databases are growing rapidly in size in the digital age. To increase the value of these data and to enhance the user experience, there is a need to make these videos searchable through automatic indexing. Because people appearing and talking in the videos are often of high interest for end users, indices that represent the location and identity of people in the archive are indispensable for video search and browsing tools. On the other hand, multimedia videos contain resourceful data of people in both visual and auditory domains. This offers a potential for multimodal learning in the task of human identification. Hence, the main theme of this thesis is on algorithms to create indexes and exploit the audio-visual correspondence in large multimedia corpuses based on person identities.
First, this thesis deals with algorithms to create indexes through person discovery in videos. It involves several components: face and speaker diarization, face-voice association, and person naming. To obtain face clusters, we propose a novel face tracking approach that leverages face detectors with a tracking-by-detection framework relying on long term time-interval sensitive association costs. We use also shot context to further accelerate and improve face clustering. Face clusters are then associated to speaker clusters using dubbing and talking detection, in which a multimodal framework is introduced to represent the temporal relationship between the auditory and visual streams. We also improve speaker embeddings for recognition and clustering by using a regularizer called intra-class loss.
In the second half, the thesis focuses on multimodal learning with face-voice data. Here, we aim to answer two research questions. First, can one improve a voice embedding using knowledge transferred from a face representation? We investigate several transfer learning approaches to constrain the target voice embedding space to share latent attributes with the source face embedding space. The crossmodal constrains act as regularizers helping voice models, especially in the low-data setting. The second question is can face clusters be used as training labels to learn a speaker embedding? To answer this, we explore the tolerance of embedding losses under label uncertainty. From the risk minimization perspective, we obtain the analytical results that provide the heuristics in strategies to improve the tolerance against label noise. We apply the findings into our task of learning speaker embeddings using face clusters as labels. While the experimental results agree with the analytical heuristics, there is still a large gap in performance between the supervised and the weakly supervised models, which requires further investigation in the future.

We study the problem of constructing epsilon-coresets for the (k, z)-clustering problem in a doubling metric M(X, d). An epsilon-coreset is a weighted subset S subset of X with weight function w : S -> R->= 0, such that for any k-subset C is an element of X, it holds that Sigma(x is an element of S) w(x).d(z) (x, C) is an element of (1 +/- epsilon) . Sigma(x is an element of X) d(z) (x, C). We present an efficient algorithm that constructs an epsilon-coreset for the (k, z)-clustering problem in M(X, d), where the size of the coreset only depends on the parameters k, z, epsilon and the doubling dimension ddim(M). To the best of our knowledge, this is the first efficient epsilon-coreset construction of size independent of vertical bar X vertical bar for general clustering problems in doubling metrics. To this end, we establish the first relation between the doubling dimension of M(X, d) and the shattering dimension (or VC-dimension) of the range space induced by the distance d. Such a relation is not known before, since one can easily construct instances in which neither one can be bounded by (some function of) the other. Surprisingly, we show that if we allow a small (1 +/- epsilon)-distortion of the distance function d (the distorted distance is called the smoothed distance function), the shattering dimension can be upper bounded by O(epsilon(-O(ddim(M)))). For the purpose of coreset construction, the above bound does not suffice as it only works for unweighted spaces. Therefore, we introduce the notion of tau-error probabilistic shattering dimension, and prove a (drastically better) upper bound of O(ddim(M).log(1/epsilon)+log log 1/tau) for the probabilistic shattering dimension for weighted doubling metrics. As it turns out, an upper bound for the probabilistic shattering dimension is enough for constructing a small coreset. We believe the new relation between doubling and shattering dimensions is of independent interest and may find other applications. Furthermore, we study robust coresets for (k, z)-clustering with outliers in a doubling metric. We show an improved connection between alpha-approximation and robust coresets. This also leads to improvement upon the previous best known bound of the size of robust coreset for Euclidean space [Feldman and Langberg, STOC 11]. The new bound entails a few new results in clustering and property testing. As another application, we show constant-sized (epsilon, k, z)-centroid sets in doubling metrics can be constructed by extending our coreset construction. Prior to our result, constant-sized centroid sets for general clustering problems were only known for Euclidean spaces. We can apply our centroid set to accelerate the local search algorithm (studied in [Friggstad et al., FOCS 2016]) for the (k, z)-clustering problem in doubling metrics.