Are you an EPFL student looking for a semester project?
Work with us on data science and visualisation projects, and deploy your project as an app on top of GraphSearch.
This report presents one month trainee work on development of French Automatic Speech Recognition ASR system using a french part of multilingual database GlobalPhone_FR. The purpose of this report is to explain and give results of the training and testing of the ASR with this specific database. Two different methods are presented, the Hidden Markov Model (HMM) with MFCC/PLP features and tandem features from Multilayer Perceptron (MLP) phone posteriors. The report presents data preparation for GlobalPhone_FR ASR training, and compares the two different approaches. Word recognition accuracy achieved with MFCC features is 71.46% and the tandem features with 3-layer MLP improved the accuracy to 72.15%. We interpret this result as a baseline for the GlobalPhone_FR database.
Loading
Loading
Loading
Loading
Loading
Philip Neil Garner, Pierre-Edouard Jean Charles Honnet
,
,
dynamism'' will be measured and integrated over time through a 2-state (speech and and non-speech) hidden Markov model (HMM) with minimum duration constraints. Indeed, in the case of entropy, it is clear that, on average, the entropy at the output of the local PDF estimators will be larger for speech signals than non-speech signals presented at their input. In our case, local probabilities will be estimated from an multilayer perceptron (MLP) as used in hybrid HMM/MLP systems, thus guaranteeing the use of
real'' probabilities in the estimation of the entropy. The 2-state speech/non-speech HMM will thus take these two dimensional features (entropy and ``dynamism'') whose distributions will be modeled through (two-dimensional) multi-Gaussian densities or an MLP, whose parameters are trained through a Viterbi algorithm.\ Different experiments, including different speech and music styles, as well as different (a priori) distributions of the speech and music signals (real data distribution, mostly speech, or mostly music), will illustrate the robustness of the approach, always resulting in a correct segmentation performance higher than 90%. Finally, we will show how a confidence measure can be used to further improve the segmentation results, and also discuss how this may be used to extend the technique to the case of speech/music mixtures.