Sparse Autoencoders for Speech Modeling and Recognition

Selen Hande Kabil
2023
Thèse EPFL

Résumé

Speech recognition-based applications upon the advancements in artificial intelligence play an essential role to transform most aspects of modern life. However, speech recognition in real-life conditions (e.g., in the presence of overlapping speech, varying speaker characteristics) remains to be a challenge. The current state of the research to achieve robust speech recognition mostly depends on building systems driven by complex deep neural networks. Nonetheless, speech production process enables low-dimensional subspaces which can carry class-specific information in speech. In this thesis, we investigate the exploitation of this low-dimensional multi-subspace structure of speech towards the goal of improving acoustic modeling for automatic speech recognition (ASR).This thesis mainly focuses on the sparse autoencoders for sparse modeling of speech, starting from their often-overlooked connection with sparse coding. We hypothesize that whenever speech signal is represented in a high-dimensional feature space, the true class information (regarding the speech content) is embedded in low-dimensional subspaces. The analysis on the high-dimensional sparse speech representations obtained from the sparse autoencoders demonstrates their prominent capability of modeling the underlying (e.g., sub-phonetic) components of speech. When used for recognition, the representations from sparse autoencoders yield performance improvements. Finally, we repurpose the aforementioned sparse autoencoders for pathological speech recognition task in transfer learning framework.In this context, the contribution of this thesis is twofold: (i) in speech modeling, proposing the use of sparse autoencoders as a novel way of sparse modeling for extracting the class-specific low-dimensional subspaces in speech features, and (ii) in speech recognition, demonstrating the effectiveness of these autoencoders in the state-of-the-art ASR frameworks towards the goal of improving robust ASR, in particular on far-field speech from AMI and pathological speech from UA-Speech datasets.

Source officielle

https://infoscience.epfl.ch/record/299895?ln=fr

À propos de ce résultat

Cette page est générée automatiquement et peut contenir des informations qui ne sont pas correctes, complètes, à jour ou pertinentes par rapport à votre recherche. Il en va de même pour toutes les autres pages de ce site. Veillez à vérifier les informations auprès des sources officielles de l'EPFL.

Sparse Autoencoders for Speech Modeling and Recognition

Graph Chatbot

Chattez avec Graph Search

Training a Filter-Based Model of the Cochlea in the Context of Pre-Trained Acoustic Models

Mapping Bibliotheca Hertziana

Novel Methods For Detection And Analysis Of Atypical Aspects In Speech

Novel Methods For Detection And Analysis Of Atypical Aspects In Speech

Training a Filter-Based Model of the Cochlea in the Context of Pre-Trained Acoustic Models

Mapping Bibliotheca Hertziana