Are you an EPFL student looking for a semester project?
Work with us on data science and visualisation projects, and deploy your project as an app on top of Graph Search.
Language independent query-by-example spoken term detection (QbE-STD) is the problem of retrieving audio documents from an archive, which contain a spoken query provided by a user. This is usually casted as a hypothesis testing and pattern matching problem, also referred to as a ``zero-resource task'' since no specific training or lexical information is required to represent the spoken query. Thus it enables multilingual search on unconstrained speech without requiring a full speech recognition system. State-of-the-art solutions typically rely on Dynamic Time Warping (DTW) based template matching using phone posteriors features estimated by Deep Neural Networks (DNN).
In this thesis, we aim at exploiting the low-dimensional subspace structure of speech signal, resulting from the constrained human speech production process. We exploit this subspace structure to improve over the state-of-the-art to (1) generate better phone or phonological posterior features, and (2) to improve the matching algorithm. To enhance phone posteriors, we learn the underlying phonetic subspaces in an unsupervised way, and use the sub-phonetic attributes to extract the phonological components in a supervised manner. To improve the matching algorithm, we model the subspaces of the spoken query using its phone posterior representation. The resulting model is used to compute distances between the subspaces of the query and the phone posteriors of each audio document. These distances are then used to detect occurrences of the spoken query, while also regularizing the DTW to improve the detection scores.
In addition to optimizing different components of the state-of-the-art system, we propose a novel DNN-based QbE-STD system to provide an end-to-end learning framework. Towards that end, we replace the DTW based matching with a Convolutional Neural Network (CNN) architecture. We also learn multilingual features, aimed at obtaining language independent representation. Finally, we integrate the feature learning and CNN-based matching to jointly train and further improve the QbE-STD performance.
We perform experiments using the challenging AMI meeting corpus (English), as well as multilingual datasets such as Spoken Web Search 2013 and Query by Example Search on Speech Task 2014, and show significant improvements over a very competitive state-of-the-art system.