Êtes-vous un étudiant de l'EPFL à la recherche d'un projet de semestre?
Travaillez avec nous sur des projets en science des données et en visualisation, et déployez votre projet sous forme d'application sur Graph Search.
The advent of statistical parametric speech synthesis has paved new ways to a unified framework for hidden Markov model (HMM) based text to speech synthesis (TTS) and automatic speech recognition (ASR). The techniques and advancements made in the field of ASR can now be adopted in the domain of synthesis. Speaker adaptation is a well-advanced topic in the area of ASR, where the adaptation data from a target speaker is used to transform the canonical model parameters to represent a speaker specific model. Feature adaptation techniques like vocal tract length normalization (VTLN) perform the same task by transforming the features ; this can be shown to be equivalent to model transformation. The main advantage of VTLN is that it can demonstrate noticeable improvements in performance with very little adaptation data and can be classified as a rapid adaptation technique. VTLN is a widely used technique in ASR, and can be used in TTS to improve the rapid adaptation performance. In TTS, the task is to synthesize speech that sounds like a particular target speaker. Using VTLN for TTS is found to make the output synthesized speech sound quite similar to the target speaker from his very first utterance. An all-pass filter based bilinear transform was implemented for the mel-generalized cepstral (MGCEP) features of the HMM-based speech synthesis system (HTS). The initial implementation was using a grid search approach that selects the best warping factor for the speech spectrum from a grid of available values using the maximum likelihood criterion. VTLN was shown to give performance improvements in the rapid adaptation framework where the number of adaptation sentences from the target speaker was limited. But, this technique involves huge time and space complexities and the rapid adaptation demands for an efficient implementation of the VTLN technique. To this end, an efficient expectation maximization (EM) based VTLN approach was implemented for HTS using Brent’s search. Unlike the ASR features, MGCEP does not use a filter bank (in order to facilitate the speech reconstruction) and this provides equivalence to the model transformation for the EM implementation. This facilitates the estimation of warping factors to be embedded in the HMM training using the same sufficient statistics as in constrained maximum likelihood linear regression (CMLLR). This work addresses a lot of challenges faced in the process of adopting VTLN for synthesis due to the higher dimensionality of the cepstral features used in the TTS models. The main idea was to unify the theory and practise in the implementation of VTLN for both ASR and TTS. Several techniques have been proposed in this thesis, in order to find the best feasible warping factor estimation procedure. Estimation of the warping factor using the lower order cepstral features representing the spectral envelope is demonstrated to be the best approach. Different evaluations on standard databases are performed in this work to illustrate the performance improvements and perceptual challenges involved in the VTLN adaptation. VTLN has only a single parameter to represent the speaker characteristics and hence, has the limitation of not scaling to the performance of other linear transform based adaptation methods with the availability of large amounts of adaptation data. Several techniques are demonstrated in this work to combine the model based adaptation like constrained structural maximum a posteriori linear regression (CSMAPLR) with VTLN, one such technique being using VTLN as the prior transform at the root node of the tree structure of the CSMAPLR system. Thus, along with rapid adaptation, the performance scales with the availability of more adaptation data. These techniques although developed for TTS, can also be effectively used in ASR. It was also shown to give improvements in ASR especially for scenarios like noisy speech conditions. Other improvements to rapid adaptation including a bias term for VTLN, multiple transform based VTLN using regression classes and VTLN prior for non-structural MAPLR adaptation are also proposed. These techniques also demonstrated both ASR and TTS performance improvements. Also, a few special scenarios, specifically cross-lingual speech, cross-gender speech, child speech and noisy speech evaluations are presented where the rapid adaptation methods presented in this work was shown to be highly beneficial. Most of these methods will be published as extensions to the open-source HTS toolkit.
Ronan Collobert, Dimitri Palaz