Multilingual Training and Adaptation in Speech Recognition

Sibo Tong
2020
EPFL thesis

Abstract

State-of-the-art acoustic models for Automatic Speech Recognition (ASR) are based on Hidden Markov Models (HMM) and Deep Neural Networks (DNN) and often require thousands of hours of transcribed speech data during training. Therefore, building multilingual ASR systems or systems on a language with few resources is a challenging task. Multilingual training and cross-lingual adaptation are potential solutions. However, context-dependent states modeling creates difficulties for multilingual and cross-lingual ASR because of the large increase in context dependent labels arising from the phone set mismatch.

The goal of this thesis is to improve current state-of-the-art acoustic modeling techniques in general for ASR, with a particular focus on multilingual ASR and cross-lingual adaptation. We systematically exploited new training frameworks, from Maximum Likelihood Estimation, Connectionist Temporal Classification to Maximum Mutual Information, in the context of phoneme-based multilingual training. In order to minimize the negative effects of data impurity arising from language mismatch, we investigated language adaptive training approaches which help further improve the multilingual ASR performance. Through comprehensive experimental comparison we demonstrated that phoneme-based multilingual models are easily extensible to unseen phonemes of new languages, from which the cross-lingual adaptation yields significant improvement over traditional approaches on limited data. Finally, we proposed a semi-supervised training approach based on dropout to boost the performance in low-resourced languages using untranscribed data.

In the other part of the thesis, we conducted more theoretical analysis of techniques found to be useful in sequential multilingual training. More specifically, we revisited the recurrent architecture based on Bayesâs theorem. This leads to a Bayesian recurrent unit dictated by the probabilistic formulation and naturally support a backward recursion. Experiments show that the proposed architecture exceeds the performance of conventional recurrent network.

Together, this thesis constitutes a thorough analysis of the current field. Through theoretical and experimental comparisons, the proposed approaches are shown to yield significant improvement over the conventional hybrid systems on multilingual speech recognition.

Official source

https://infoscience.epfl.ch/record/278715?ln=en

About this result

This page is automatically generated and may contain information that is not correct, complete, up-to-date, or relevant to your search query. The same applies to every other page on this website. Please make sure to verify the information with EPFL's official sources.

Multilingual Training and Adaptation in Speech Recognition

Graph Chatbot

Chat with Graph Search

Infusing structured knowledge priors in neural models for sample-efficient symbolic reasoning

Robust machine learning for neuroscientific inference

Random matrix methods for high-dimensional machine learning models

Robust machine learning for neuroscientific inference

Infusing structured knowledge priors in neural models for sample-efficient symbolic reasoning

Random matrix methods for high-dimensional machine learning models