Publication

HMM inference towards flexible speech recognition

2003
Report or working paper
Abstract

One of the difficulties in Automatic Speech Recognizer (ASR) is the pronunciation variability. Each word (modeled by a baseline phonetic transcription in the ASR dictionary) can be pronounced in many different ways depending on many complex qualitative and quantitative factors such as the dialect of the speaker, the speaker's gender, the speaker's age and the difference in the vocal tract length of different speakers. This project focuses on the pronunciation modelling in order to better capture the pronunciation variability. The basic idea, based on the inference of Hidden Markov Model (HMM), is to relax the lexical constraint. For each word of the dictionary, we transform the baseline phonetic transcription to an equivalent constrained ergodic HMM. This constrained model is then iteratively relaxed to converge towards a truly ergodic HMM, capable to generate any phone sequence. At each relaxation, a pronunciation model (or many pronunciation models if the HMM inference is tested on many utterances of the word) is inferred by the Viterbi algorithm. Next, the performance of this inferred model is measured in terms of confidence measure (showing how well the inferred model matches with acoustic data) and by a Levenshtein distance (showing how much the inferred model diverges from the baseline phonetic transcription). The method is tested on a list of 75 English words of the PhoneBook Database. We observe that, for many of them, the baseline phonetic transcription is a good pronunciation model since it is stable across many relaxations. It means that such baseform is robust to the pronunciation variability . Next, we also observe that, we can infer a new pronunciation model, close to the baseform in terms of phone sequence and also stable when the constrained ergodic HMM is relaxed. In this case, the solution is to include this inferred model ( the baseform model) in the dictionary. For few words, the baseform could not be suitable for many speakers (low matching with acoustic data and high divergence). Finally, the project is done in the context of hybrid HMM/ANN recognizer (using Artificial Neural Networks (ANN) to estimate local posterior probabilities). Additionally, we compare, with the HMM inference technique, two ASR systems namely baseline system (trained with standard features) and pitch-based system. We observe that the pitch-based MLP not only improves the matching between the acoustic data and the pronunciation model but also the stability of the baseform pronunciation model.

About this result
This page is automatically generated and may contain information that is not correct, complete, up-to-date, or relevant to your search query. The same applies to every other page on this website. Please make sure to verify the information with EPFL's official sources.
Related concepts (36)
Received Pronunciation
Received Pronunciation (RP) is the accent traditionally regarded as the standard and most prestigious form of spoken British English. For over a century, there has been argument over such questions as the definition of RP, whether it is geographically neutral, how many speakers there are, whether sub-varieties exist, how appropriate a choice it is as a standard and how the accent has changed over time. The name itself is controversial.
Hidden Markov model
A hidden Markov model (HMM) is a statistical Markov model in which the system being modeled is assumed to be a Markov process — call it — with unobservable ("hidden") states. As part of the definition, HMM requires that there be an observable process whose outcomes are "influenced" by the outcomes of in a known way.
Greek alphabet
The Greek alphabet has been used to write the Greek language since the late 9th or early 8th century BC. It is derived from the earlier Phoenician alphabet, and was the earliest known alphabetic script to have distinct letters for vowels as well as consonants. In Archaic and early Classical times, the Greek alphabet existed in many local variants, but, by the end of the 4th century BC, the Euclidean alphabet, with 24 letters, ordered from alpha to omega, had become standard and it is this version that is still used for Greek writing today.
Show more
Related publications (34)

Novel Methods For Detection And Analysis Of Atypical Aspects In Speech

Julian David Fritsch

Atypical aspects in speech concern speech that deviates from what is commonly considered normal or healthy. In this thesis, we propose novel methods for detection and analysis of these aspects, e.g. to monitor the temporary state of a speaker, diseases tha ...
EPFL2023

Robust Outlier Rejection for 3D Registration with Variational Bayes

Mathieu Salzmann, Jiancheng Yang, Zheng Dang, Zhen Wei, Haobo Jiang

Learning-based outlier (mismatched correspondence) rejection for robust 3D registration generally formulates the outlier removal as an inlier/outlier classification problem. The core for this to be successful is to learn the discriminative inlier/outlier f ...
Los Alamitos2023

On quantifying the quality of acoustic models in hybrid DNN-HMM ASR

Hervé Bourlard, Afsaneh Asaei, Pranay Dighe

We propose an information theoretic framework for quantitative assessment of acoustic models used in hidden Markov model (HMM) based automatic speech recognition (ASR). The HMM backend expects that (i) the acoustic model yields accurate state conditional e ...
ELSEVIER2020
Show more
Related MOOCs (4)
Simulation Neurocience
Learn how to digitally reconstruct a single neuron to better study the biological mechanisms of brain function, behaviour and disease.
Simulation Neurocience
Learn how to digitally reconstruct a single neuron to better study the biological mechanisms of brain function, behaviour and disease.
Simulation Neurocience
Learn how to digitally reconstruct a single neuron to better study the biological mechanisms of brain function, behaviour and disease.
Show more

Graph Chatbot

Chat with Graph Search

Ask any question about EPFL courses, lectures, exercises, research, news, etc. or try the example questions below.

DISCLAIMER: The Graph Chatbot is not programmed to provide explicit or categorical answers to your questions. Rather, it transforms your questions into API requests that are distributed across the various IT services officially administered by EPFL. Its purpose is solely to collect and recommend relevant references to content that you can explore to help you answer your questions.