**Are you an EPFL student looking for a semester project?**

Work with us on data science and visualisation projects, and deploy your project as an app on top of GraphSearch.

Concept# Hierarchical clustering

Summary

In data mining and statistics, hierarchical clustering (also called hierarchical cluster analysis or HCA) is a method of cluster analysis that seeks to build a hierarchy of clusters. Strategies for hierarchical clustering generally fall into two categories:
Agglomerative: This is a "bottom-up" approach: Each observation starts in its own cluster, and pairs of clusters are merged as one moves up the hierarchy.
Divisive: This is a "top-down" approach: All observations start in one cluster, and splits are performed recursively as one moves down the hierarchy.
In general, the merges and splits are determined in a greedy manner. The results of hierarchical clustering are usually presented in a dendrogram.
Hierarchical clustering has the distinct advantage that any valid measure of distance can be used. In fact, the observations themselves are not required: all that is used is a matrix of distances. On the other hand, except for the special case of single-linkage distance, none of the algorithms (except exhaustive search in ) can be guaranteed to find the optimum solution.
The standard algorithm for hierarchical agglomerative clustering (HAC) has a time complexity of and requires memory, which makes it too slow for even medium data sets. However, for some special cases, optimal efficient agglomerative methods (of complexity ) are known: SLINK for single-linkage and CLINK for complete-linkage clustering. With a heap, the runtime of the general case can be reduced to , an improvement on the aforementioned bound of , at the cost of further increasing the memory requirements. In many cases, the memory overheads of this approach are too large to make it practically usable.
Divisive clustering with an exhaustive search is , but it is common to use faster heuristics to choose splits, such as k-means.
In order to decide which clusters should be combined (for agglomerative), or where a cluster should be split (for divisive), a measure of dissimilarity between sets of observations is required.

Official source

This page is automatically generated and may contain information that is not correct, complete, up-to-date, or relevant to your search query. The same applies to every other page on this website. Please make sure to verify the information with EPFL's official sources.

Related publications (10)

Loading

Loading

Loading

Related people (8)

Related units

No results

Related concepts

Loading

Related courses

Loading

Related lectures

Loading

Related MOOCs (2)

Selected chapters form winterschool on multi-scale brain

Understanding the brain requires an integrated understanding of different scales of organisation of the brain. This Massive Open Online Course (MOOC) will take the you through the latest data, models

Selected chapters form winterschool on multi-scale brain

Understanding the brain requires an integrated understanding of different scales of organisation of the brain. This Massive Open Online Course (MOOC) will take the you through the latest data, models

Related concepts (18)

Hierarchical clustering

In data mining and statistics, hierarchical clustering (also called hierarchical cluster analysis or HCA) is a method of cluster analysis that seeks to build a hierarchy of clusters. Strategies for hierarchical clustering generally fall into two categories: Agglomerative: This is a "bottom-up" approach: Each observation starts in its own cluster, and pairs of clusters are merged as one moves up the hierarchy. Divisive: This is a "top-down" approach: All observations start in one cluster, and splits are performed recursively as one moves down the hierarchy.

Cluster analysis

Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense) to each other than to those in other groups (clusters). It is a main task of exploratory data analysis, and a common technique for statistical data analysis, used in many fields, including pattern recognition, , information retrieval, bioinformatics, data compression, computer graphics and machine learning.

Scikit-learn

scikit-learn (formerly scikits.learn and also known as sklearn) is a free software machine learning library for the Python programming language. It features various classification, regression and clustering algorithms including support-vector machines, random forests, gradient boosting, k-means and DBSCAN, and is designed to interoperate with the Python numerical and scientific libraries NumPy and SciPy. Scikit-learn is a NumFOCUS fiscally sponsored project. The scikit-learn project started as scikits.

Related courses (39)

PHYS-467: Machine learning for physicists

Machine learning and data analysis are becoming increasingly central in sciences including physics. In this course, fundamental principles and methods of machine learning will be introduced and practi

CS-401: Applied data analysis

This course teaches the basic techniques, methodologies, and practical skills required to draw meaningful insights from a variety of data, with the help of the most acclaimed software tools in the dat

PHYS-512: Statistical physics of computation

This course covers the statistical physics approach to computer science problems ranging from graph theory and constraint satisfaction to inference and machine learning. In particular the replica and

Related lectures (273)

Supervised Learning: k-NN and Decision Trees

Introduces supervised learning with k-NN and decision trees, covering techniques, examples, and ensemble methods.

Predicting Rainfall: Miniproject BIO-322

Introduces a miniproject where students predict rainfall in Pully using machine learning, focusing on reproducibility and code quality.

Statistical Physics of Clusters

Explores the statistical physics of clusters, focusing on complexity and equilibrium behavior.

Musical grammar describes a set of principles that are used to understand and interpret the structure of a piece according to a musical style.
The main topic of this study is grammar induction for har

Flow-based generative models have become an important class of unsupervised learning approaches. In this work, we incorporate the key ideas of renormalization group (RG) and sparse prior distribution

Alcherio Martinoli, Chiara Ercolani, Lixuan Tang, Ankita Arun Humne

Chemical gas dispersion can represent a severe threat to human and animal lives, as well as to the environment. Constructing a map of the distribution of gas in a fast and reliable manner is critical

2022