Are you an EPFL student looking for a semester project?
Work with us on data science and visualisation projects, and deploy your project as an app on top of Graph Search.
Speech recognition-based applications upon the advancements in artificial intelligence play an essential role to transform most aspects of modern life. However, speech recognition in real-life conditions (e.g., in the presence of overlapping speech, varying speaker characteristics) remains to be a challenge. The current state of the research to achieve robust speech recognition mostly depends on building systems driven by complex deep neural networks. Nonetheless, speech production process enables low-dimensional subspaces which can carry class-specific information in speech. In this thesis, we investigate the exploitation of this low-dimensional multi-subspace structure of speech towards the goal of improving acoustic modeling for automatic speech recognition (ASR).This thesis mainly focuses on the sparse autoencoders for sparse modeling of speech, starting from their often-overlooked connection with sparse coding. We hypothesize that whenever speech signal is represented in a high-dimensional feature space, the true class information (regarding the speech content) is embedded in low-dimensional subspaces. The analysis on the high-dimensional sparse speech representations obtained from the sparse autoencoders demonstrates their prominent capability of modeling the underlying (e.g., sub-phonetic) components of speech. When used for recognition, the representations from sparse autoencoders yield performance improvements. Finally, we repurpose the aforementioned sparse autoencoders for pathological speech recognition task in transfer learning framework.In this context, the contribution of this thesis is twofold: (i) in speech modeling, proposing the use of sparse autoencoders as a novel way of sparse modeling for extracting the class-specific low-dimensional subspaces in speech features, and (ii) in speech recognition, demonstrating the effectiveness of these autoencoders in the state-of-the-art ASR frameworks towards the goal of improving robust ASR, in particular on far-field speech from AMI and pathological speech from UA-Speech datasets.