Publication

Language Independent Query by Example Spoken Term Detection

Dhananjay Ram
2019
EPFL thesis
Abstract

Language independent query-by-example spoken term detection (QbE-STD) is the problem of retrieving audio documents from an archive, which contain a spoken query provided by a user. This is usually casted as a hypothesis testing and pattern matching problem, also referred to as a ``zero-resource task'' since no specific training or lexical information is required to represent the spoken query. Thus it enables multilingual search on unconstrained speech without requiring a full speech recognition system. State-of-the-art solutions typically rely on Dynamic Time Warping (DTW) based template matching using phone posteriors features estimated by Deep Neural Networks (DNN).

In this thesis, we aim at exploiting the low-dimensional subspace structure of speech signal, resulting from the constrained human speech production process. We exploit this subspace structure to improve over the state-of-the-art to (1) generate better phone or phonological posterior features, and (2) to improve the matching algorithm. To enhance phone posteriors, we learn the underlying phonetic subspaces in an unsupervised way, and use the sub-phonetic attributes to extract the phonological components in a supervised manner. To improve the matching algorithm, we model the subspaces of the spoken query using its phone posterior representation. The resulting model is used to compute distances between the subspaces of the query and the phone posteriors of each audio document. These distances are then used to detect occurrences of the spoken query, while also regularizing the DTW to improve the detection scores.

In addition to optimizing different components of the state-of-the-art system, we propose a novel DNN-based QbE-STD system to provide an end-to-end learning framework. Towards that end, we replace the DTW based matching with a Convolutional Neural Network (CNN) architecture. We also learn multilingual features, aimed at obtaining language independent representation. Finally, we integrate the feature learning and CNN-based matching to jointly train and further improve the QbE-STD performance.

We perform experiments using the challenging AMI meeting corpus (English), as well as multilingual datasets such as Spoken Web Search 2013 and Query by Example Search on Speech Task 2014, and show significant improvements over a very competitive state-of-the-art system.

About this result
This page is automatically generated and may contain information that is not correct, complete, up-to-date, or relevant to your search query. The same applies to every other page on this website. Please make sure to verify the information with EPFL's official sources.
Related concepts (41)
Feature learning
In machine learning, feature learning or representation learning is a set of techniques that allows a system to automatically discover the representations needed for feature detection or classification from raw data. This replaces manual feature engineering and allows a machine to both learn the features and use them to perform a specific task. Feature learning is motivated by the fact that machine learning tasks such as classification often require input that is mathematically and computationally convenient to process.
Convolutional neural network
Convolutional neural network (CNN) is a regularized type of feed-forward neural network that learns feature engineering by itself via filters (or kernel) optimization. Vanishing gradients and exploding gradients, seen during backpropagation in earlier neural networks, are prevented by using regularized weights over fewer connections. For example, for each neuron in the fully-connected layer 10,000 weights would be required for processing an image sized 100 × 100 pixels.
Types of artificial neural networks
There are many types of artificial neural networks (ANN). Artificial neural networks are computational models inspired by biological neural networks, and are used to approximate functions that are generally unknown. Particularly, they are inspired by the behaviour of neurons and the electrical signals they convey between input (such as from the eyes or nerve endings in the hand), processing, and output from the brain (such as reacting to light, touch, or heat). The way neurons semantically communicate is an area of ongoing research.
Show more
Related publications (299)

Topics in statistical physics of high-dimensional machine learning

Hugo Chao Cui

In the past few years, Machine Learning (ML) techniques have ushered in a paradigm shift, allowing the harnessing of ever more abundant sources of data to automate complex tasks. The technical workhorse behind these important breakthroughs arguably lies in ...
EPFL2024

Performing and Detecting Backdoor Attacks on Face Recognition Algorithms

Alexander Carl Unnervik

The field of biometrics, and especially face recognition, has seen a wide-spread adoption the last few years, from access control on personal devices such as phones and laptops, to automated border controls such as in airports. The stakes are increasingly ...
EPFL2024

Driving and suppressing the human language network using large language models

Martin Schrimpf

Transformer models such as GPT generate human-like language and are predictive of human brain responses to language. Here, using functional-MRI-measured brain responses to 1,000 diverse sentences, we first show that a GPT-based encoding model can predict t ...
Berlin2024
Show more

Graph Chatbot

Chat with Graph Search

Ask any question about EPFL courses, lectures, exercises, research, news, etc. or try the example questions below.

DISCLAIMER: The Graph Chatbot is not programmed to provide explicit or categorical answers to your questions. Rather, it transforms your questions into API requests that are distributed across the various IT services officially administered by EPFL. Its purpose is solely to collect and recommend relevant references to content that you can explore to help you answer your questions.