Publication

Joint speech and speaker recognition

2005
EPFL thesis
Abstract

The goal of the thesis is to investigate different approaches that combine and integrate Automatic Speech Recognition (ASR) and Speaker Recognition (SR) systems, with applications to (1) User-Customized Password Speaker Verification (UCP-SV) systems, and, (2) joint speech and speaker recognition. Unlike text-dependent speaker verification systems, in UCP-SV systems, customers can choose easily their own password, which has to be pronounced a few times during enrollment to create a customer specific model that will be subsequently used for verification. The main assumption in such systems is that no a priori knowledge about the password (such as its phonetic transcription) is available. However, although more user-friendly and more secure, UCP-SV systems are less understood and actually exhibit several new challenges, including: automatic inference of Hidden Markov Model (HMM) password (using a speaker-independent ASR system), fast speaker adaptation of the resulting acoustic models, score normalization, and verification of both lexical and speaker characteristics. Development and evaluation of such systems are then based on their ability to jointly verify: (1) the identity of a claimed speaker, (2) pronouncing the correct password, and thus rejecting all other possible alternatives. In this thesis, two different speaker acoustic modeling approaches are investigated: HMM/GMM approach (based on Gaussian Mixture Model, GMM) and hybrid HMM/MLP approach (based on Multi-Layer Perceptron, MLP). In the case of HMM/GMM approach, the background model used for likelihood normalization was the main difficulty, and several solutions were investigated to improve the baseline system. In the HMM/MLP approach, MLP adaptation was also a problem. In that context, we found that the modeling capability of the adapted MLP was more towards learning the lexical content of the password than the customer's voice characteristics. Therefore, a probabilistic framework that combines the hybrid HMM/MLP systems and GMM is proposed and extensively investigated. In this case, the HMM/MLP system is used for utterance verification, while GMM is used for speaker verification. Since UCP-SV involves both speech recognition (ASR) and speaker verification (SV), a natural extension of our work was to also investigate new approaches towards using ASR together with Speaker Recognition (SR) to improve both ASR and SR systems. In this framework, we show in this thesis that optimization and recognition based on a joint ASR-SR posterior probability criterion yields better ASR and SR performance, beyond what could be achieved from the two systems independently, as well as from a "sequential" approach (e.g., first performing speaker identification/ clustering, followed by speech recognition). This work resulted in a PC-based real time implementation of an HMM based UCP-SV system available for demonstration.

About this result
This page is automatically generated and may contain information that is not correct, complete, up-to-date, or relevant to your search query. The same applies to every other page on this website. Please make sure to verify the information with EPFL's official sources.
Related concepts (38)
Speaker recognition
Speaker recognition is the identification of a person from characteristics of voices. It is used to answer the question "Who is speaking?" The term voice recognition can refer to speaker recognition or speech recognition. Speaker verification (also called speaker authentication) contrasts with identification, and speaker recognition differs from speaker diarisation (recognizing when the same speaker is speaking).
Speech recognition
Speech recognition is an interdisciplinary subfield of computer science and computational linguistics that develops methodologies and technologies that enable the recognition and translation of spoken language into text by computers. It is also known as automatic speech recognition (ASR), computer speech recognition or speech to text (STT). It incorporates knowledge and research in the computer science, linguistics and computer engineering fields. The reverse process is speech synthesis.
Deep learning
Deep learning is part of a broader family of machine learning methods, which is based on artificial neural networks with representation learning. The adjective "deep" in deep learning refers to the use of multiple layers in the network. Methods used can be either supervised, semi-supervised or unsupervised.
Show more
Related publications (86)

Validating Automatic Speech Recognition and Understanding for Pre-Filling Radar Labels-Increasing Safety While Reducing Air Traffic Controllers' Workload

Juan Pablo Zuluaga Gomez

Automatic speech recognition and understanding (ASRU) for air traffic control (ATC) has been investigated in different ATC environments and applications. The objective of this study was to quantify the effect of ASRU support for air traffic controllers (AT ...
2023

On quantifying the quality of acoustic models in hybrid DNN-HMM ASR

Hervé Bourlard, Afsaneh Asaei, Pranay Dighe

We propose an information theoretic framework for quantitative assessment of acoustic models used in hidden Markov model (HMM) based automatic speech recognition (ASR). The HMM backend expects that (i) the acoustic model yields accurate state conditional e ...
ELSEVIER2020

Phonetic aware techniques for Speaker Verification

Subhadeep Dey

The goal of this thesis is to improve current state-of-the-art techniques in speaker verification (SV), typically based on “identity-vectors” (i-vectors) and deep neural network (DNN), by exploiting diverse (phonetic) information extracted using variou ...
EPFL2018
Show more

Graph Chatbot

Chat with Graph Search

Ask any question about EPFL courses, lectures, exercises, research, news, etc. or try the example questions below.

DISCLAIMER: The Graph Chatbot is not programmed to provide explicit or categorical answers to your questions. Rather, it transforms your questions into API requests that are distributed across the various IT services officially administered by EPFL. Its purpose is solely to collect and recommend relevant references to content that you can explore to help you answer your questions.