HMM inference towards flexible speech recognition

2003
Report or working paper

Abstract

One of the difficulties in Automatic Speech Recognizer (ASR) is the pronunciation variability. Each word (modeled by a baseline phonetic transcription in the ASR dictionary) can be pronounced in many different ways depending on many complex qualitative and quantitative factors such as the dialect of the speaker, the speaker's gender, the speaker's age and the difference in the vocal tract length of different speakers. This project focuses on the pronunciation modelling in order to better capture the pronunciation variability. The basic idea, based on the inference of Hidden Markov Model (HMM), is to relax the lexical constraint. For each word of the dictionary, we transform the baseline phonetic transcription to an equivalent constrained ergodic HMM. This constrained model is then iteratively relaxed to converge towards a truly ergodic HMM, capable to generate any phone sequence. At each relaxation, a pronunciation model (or many pronunciation models if the HMM inference is tested on many utterances of the word) is inferred by the Viterbi algorithm. Next, the performance of this inferred model is measured in terms of confidence measure (showing how well the inferred model matches with acoustic data) and by a Levenshtein distance (showing how much the inferred model diverges from the baseline phonetic transcription). The method is tested on a list of 75 English words of the PhoneBook Database. We observe that, for many of them, the baseline phonetic transcription is a good pronunciation model since it is stable across many relaxations. It means that such baseform is robust to the pronunciation variability . Next, we also observe that, we can infer a new pronunciation model, close to the baseform in terms of phone sequence and also stable when the constrained ergodic HMM is relaxed. In this case, the solution is to include this inferred model ( the baseform model) in the dictionary. For few words, the baseform could not be suitable for many speakers (low matching with acoustic data and high divergence). Finally, the project is done in the context of hybrid HMM/ANN recognizer (using Artificial Neural Networks (ANN) to estimate local posterior probabilities). Additionally, we compare, with the HMM inference technique, two ASR systems namely baseline system (trained with standard features) and pitch-based system. We observe that the pitch-based MLP not only improves the matching between the acoustic data and the pronunciation model but also the stability of the baseform pronunciation model.

Official source

https://infoscience.epfl.ch/record/82874?ln=en

About this result

This page is automatically generated and may contain information that is not correct, complete, up-to-date, or relevant to your search query. The same applies to every other page on this website. Please make sure to verify the information with EPFL's official sources.

HMM inference towards flexible speech recognition

Graph Chatbot

Chat with Graph Search

Novel Methods For Detection And Analysis Of Atypical Aspects In Speech

Robust Outlier Rejection for 3D Registration with Variational Bayes

On quantifying the quality of acoustic models in hybrid DNN-HMM ASR

Novel Methods For Detection And Analysis Of Atypical Aspects In Speech

Robust Outlier Rejection for 3D Registration with Variational Bayes

On quantifying the quality of acoustic models in hybrid DNN-HMM ASR