**Êtes-vous un étudiant de l'EPFL à la recherche d'un projet de semestre?**

Travaillez avec nous sur des projets en science des données et en visualisation, et déployez votre projet sous forme d'application sur GraphSearch.

Publication# On quantifying the quality of acoustic models in hybrid DNN-HMM ASR

Résumé

We propose an information theoretic framework for quantitative assessment of acoustic models used in hidden Markov model (HMM) based automatic speech recognition (ASR). The HMM backend expects that (i) the acoustic model yields accurate state conditional emission probabilities for the observations at each time step, and (ii) the conditional probability distribution of the data given the underlying hidden state is independent of any other state in the sequence. The latter property is also known as the Markovian conditional independence assumption of HMM based modeling. In this work, we cast HMM based ASR as a communication channel in which the acoustic model computes the state emission probabilities as the input of the channel and the channel outputs the most probable hidden state sequence. The quality of the acoustic model is thus quantified in terms of the amount of information transmitted through this channel as well as how robust this channel is against the mismatch between the data and HMM's conditional independence assumption. To formulate the required information theoretic terms, we utilize the gamma posterior (or state occupancy) probabilities of HMM hidden states to derive a simple and straightforward analysis framework which assesses the benefits and shortcomings of various acoustic models in HMM based ASR. Our approach enables us to analyse acoustic modeling with Gaussian mixture models (GMM) as well as deep neural networks (DNN) (with different number of hidden layers) without actually evaluating their ASR performance explicitly. As use cases, we apply our analysis on sequence discriminatively trained DNN acoustic models as well as state-of-the-art recurrent and time-delay neural networks to compare their efficacy as acoustic models in HMM based ASR. In addition, we also use our analysis to study the contribution of sparse and low-dimensional models in enhancing acoustic modeling for better compliance with the HMM requirements.

Official source

Cette page est générée automatiquement et peut contenir des informations qui ne sont pas correctes, complètes, à jour ou pertinentes par rapport à votre recherche. Il en va de même pour toutes les autres pages de ce site. Veillez à vérifier les informations auprès des sources officielles de l'EPFL.

Concepts associés

Chargement

Publications associées

Chargement

Concepts associés (30)

Reconnaissance automatique de la parole

vignette|droite|upright=1.4|La reconnaissance vocale est habituellement traitée dans le middleware ; les résultats sont transmis aux applications utilisatrices.
La reconnaissance automatique de la pa

Modèle de Markov caché

Un modèle de Markov caché (MMC, terme et définition normalisés par l’ISO/CÉI [ISO/IEC 2382-29:1999]) — (HMM)—, ou plus correctement (mais non employé) automate de Markov à états cachés, est un modèl

Apprentissage profond

L'apprentissage profond ou apprentissage en profondeur (en anglais : deep learning, deep structured learning, hierarchical learning) est un sous-domaine de l’intelligence artificiel

Publications associées (103)

Chargement

Chargement

Chargement

Afsaneh Asaei, Hervé Bourlard, Pranay Dighe

We propose to model the acoustic space of deep neural network (DNN) class-conditional posterior probabilities as a union of low- dimensional subspaces. To that end, the training posteriors are used for dictionary learning and sparse coding. Sparse representation of the test posteriors using this dictionary enables projection to the space of training data. Relying on the fact that the intrinsic di- mensions of the posterior subspaces are indeed very small and the matrix of all posteriors belonging to a class has a very low rank, we demonstrate how low-dimensional structures enable further en- hancement of the posteriors and rectify the spurious errors due to mismatch conditions. The enhanced acoustic modeling method leads to improvements in continuous speech recognition task using hybrid DNN-HMM (hidden Markov model) framework in both clean and noisy conditions, where upto 15.4% relative reduction in word error rate (WER) is achieved.

Afsaneh Asaei, Hervé Bourlard, Pranay Dighe

We propose to model the acoustic space of deep neural network (DNN) class-conditional posterior probabilities as a union of lowdimensional subspaces. To that end, the training posteriors are used for dictionary learning and sparse coding. Sparse representation of the test posteriors using this dictionary enables projection to the space of training data. Relying on the fact that the intrinsic dimensions of the posterior subspaces are indeed very small and the matrix of all posteriors belonging to a class has a very low rank, we demonstrate how low-dimensional structures enable further enhancement of the posteriors and rectify the spurious errors due to mismatch conditions. The enhanced acoustic modeling method leads to improvements in continuous speech recognition task using hybrid DNN-HMM (hidden Markov Models) framework in both clean and noisy conditions.

Afsaneh Asaei, Hervé Bourlard, Pranay Dighe

Conventional deep neural networks (DNN) for speech acoustic modeling rely on Gaussian mixture models (GMM) and hidden Markov model (HMM) to obtain binary class labels as the targets for DNN training. Subword classes in speech recognition systems correspond to context-dependent tied states or senones. The present work addresses some limitations of GMM-HMM senone alignments for DNN training. We hypothesize that the senone probabilities obtained from a DNN trained with binary labels can provide more accurate targets to learn better acoustic models. However, DNN outputs bear inaccuracies which are exhibited as high dimensional unstructured noise, whereas the informative components are structured and lowdimensional. We exploit principal component analysis (PCA) and sparse coding to characterize the senone subspaces. Enhanced probabilities obtained from low-rank and sparse reconstructions are used as soft-targets for DNN acoustic modeling, that also enables training with untranscribed data. Experiments conducted on AMI corpus shows 4.6% relative reduction in word error rate.