Are you an EPFL student looking for a semester project?
Work with us on data science and visualisation projects, and deploy your project as an app on top of Graph Search.
Automatic speech recognition (ASR) systems, through use of the phoneme as an intermediary unit representation, split the problem of modeling the relationship between the written form, i.e., the text and the acoustic speech signal into two disjoint processes. The first process deals with modeling of the relationship between the written form and phonemes through development of a pronunciation dictionary using prior knowledge about grapheme-to-phoneme relationships. Given the pronunciation lexicon and the transcribed speech data, the second process then deals with modeling of the relationship between the phonemes and the acoustic speech signal using statistical sequence processing techniques, such as hidden Markov models. As a consequence of the two disjoint processes, development of an ASR system heavily relies on the availability of well-developed acoustic and lexical resources in the target language. This paper presents an approach where the relationship between graphemes and phonemes is learned through acoustic data, more precisely, through phoneme posterior probabilities estimated from the speech signal. In doing so, the approach tightly couples the above mentioned two processes and leads to a framework where, existing acoustic and lexical resources from different domains and languages can be effectively exploited to build ASR systems without development of a pronunciation lexicon and to develop lexical resources for resource scarce domains and languages. We demonstrate these capabilities of the proposed approach through cross domain studies in English, where the grapheme-to-phoneme relationship is deep.
Ramya Rasipuram, Marzieh Razavi