Speaker diarisation

Speaker diarisation (or diarization) is the process of partitioning an audio stream containing human speech into homogeneous segments according to the identity of each speaker. It can enhance the readability of an automatic speech transcription by structuring the audio stream into speaker turns and, when used together with speaker recognition systems, by providing the speaker’s true identity. It is used to answer the question "who spoke when?" Speaker diarisation is a combination of speaker segmentation and speaker clustering. The first aims at finding speaker change points in an audio stream. The second aims at grouping together speech segments on the basis of speaker characteristics. With the increasing number of broadcasts, meeting recordings and voice mail collected every year, speaker diarisation has received much attention by the speech community, as is manifested by the specific evaluations devoted to it under the auspices of the National Institute of Standards and Technology for telephone speech, broadcast news and meetings. In speaker diarisation, one of the most popular methods is to use a Gaussian mixture model to model each of the speakers, and assign the corresponding frames for each speaker with the help of a Hidden Markov Model. There are two main kinds of clustering scenario. The first one is by far the most popular and is called Bottom-Up. The algorithm starts in splitting the full audio content in a succession of clusters and progressively tries to merge the redundant clusters in order to reach a situation where each cluster corresponds to a real speaker. The second clustering strategy is called top-down and starts with one single cluster for all the audio data and tries to split it iteratively until reaching a number of clusters equal to the number of speakers. A 2010 review can be found at . More recently, speaker diarisation is performed via neural networks leveraging large-scale GPU computing and methodological developments in deep learning.

À propos de ce résultat

Cette page est générée automatiquement et peut contenir des informations qui ne sont pas correctes, complètes, à jour ou pertinentes par rapport à votre recherche. Il en va de même pour toutes les autres pages de ce site. Veillez à vérifier les informations auprès des sources officielles de l'EPFL.

Publications associées (5)

Personnes associées (1)

Cours associés (2)

While public speech resources become increasingly available, there is a growing interest to preserve the privacy of the speakers, through methods that anonymize the speaker information from speech whi

ACADEMIC PRESS LTD- ELSEVIER SCIENCE LTD2022

Speech signal conveys several kinds of information such as a message, speaker identity, emotional state of the speaker and social state of the speaker. Automatic speech assessment is a broad area that

EPFL2021

Robust audio segmentation

Jitendra Ajmera

Audio segmentation, in general, is the task of segmenting a continuous audio stream in terms of acoustically homogenous regions, where the rule of homogeneity depends on the task. This thesis aims at

EPFL2005

EE-554: Automatic speech processing

The goal of this course is to provide the students with the main formalisms, models and algorithms required for the implementation of advanced speech processing applications (involving, among others,

EE-623: Perception and learning from multimodal sensors

The course will cover different aspects of multimodal processing (complementarity vs redundancy; alignment and synchrony; fusion), with an emphasis on the analysis of people, behaviors and interaction