Publication

Trustworthy speaker recognition with minimal prior knowledge using neural networks

Hannah Muckenhirn
2019
EPFL thesis
Abstract

The performance of speaker recognition systems has considerably improved in the last decade. This is mainly due to the development of Gaussian mixture model-based systems and in particular to the use of i-vectors. These systems handle relatively well noise and channel mismatches and yield a low error rate when confronted with zero-effort impostors, i.e. impostors using their own voice but claiming to be someone else. However, speaker verification systems are vulnerable to more sophisticated attacks, called presentation or spoofing attacks. In that case, the impostors present a fake sample to the system, which can either be generated with a speech synthesis or voice conversion algorithm or can be a previous recording of the target speaker. One way to make speaker recognition systems robust to this type of attack is to integrate a presentation attack detection system. Current methods for speaker recognition and presentation attack detection are largely based on short-term spectral processing. This has certain limitations. For instance, state-of-the-art speaker verification systems use cepstral features, which mainly capture vocal tract system characteristics, although voice source characteristics are also speaker discriminative. In the case of presentation attack detection, there is little prior knowledge that can guide us to differentiate bona fide samples from presentation attacks, as they are both speech signals that carry the same high level information, such as message, speaker identity and information about environment. This thesis focuses on developing speaker verification and presentation attack detection systems that rely on minimal assumptions. Towards that, inspired by recent advances in deep learning, we first develop speaker verification approaches where speaker discriminative information is learned from raw waveforms using convolutional neural networks (CNNs). We show that such approaches are capable of learning both voice source related and vocal tract system related speaker discriminative information and yield performance competitive to state of the art systems, namely i-vectors and x-vectors-based systems. We then develop two high performing approaches for presentation attack detection: one based on long-term spectral statistics and the other based on raw speech modeling using CNNs. We show that these two approaches are complementary and make the speaker verification systems robust to presentation attacks. Finally, we develop a visualization method inspired from the computer vision community to gain insight about the task-specific information captured by the CNNs from the raw speech signals.

About this result
This page is automatically generated and may contain information that is not correct, complete, up-to-date, or relevant to your search query. The same applies to every other page on this website. Please make sure to verify the information with EPFL's official sources.
Related concepts (35)
Speech recognition
Speech recognition is an interdisciplinary subfield of computer science and computational linguistics that develops methodologies and technologies that enable the recognition and translation of spoken language into text by computers. It is also known as automatic speech recognition (ASR), computer speech recognition or speech to text (STT). It incorporates knowledge and research in the computer science, linguistics and computer engineering fields. The reverse process is speech synthesis.
Speaker recognition
Speaker recognition is the identification of a person from characteristics of voices. It is used to answer the question "Who is speaking?" The term voice recognition can refer to speaker recognition or speech recognition. Speaker verification (also called speaker authentication) contrasts with identification, and speaker recognition differs from speaker diarisation (recognizing when the same speaker is speaking).
Spoofing attack
In the context of information security, and especially network security, a spoofing attack is a situation in which a person or program successfully identifies as another by falsifying data, to gain an illegitimate advantage. IP address spoofing and ARP spoofing Many of the protocols in the TCP/IP suite do not provide mechanisms for authenticating the source or destination of a message, leaving them vulnerable to spoofing attacks when extra precautions are not taken by applications to verify the identity of the sending or receiving host.
Show more
Related publications (87)

Adjustable deterministic pseudonymization of speech

Mathew Magimai Doss, Subrahmanya Pavankumar Dubagunta

While public speech resources become increasingly available, there is a growing interest to preserve the privacy of the speakers, through methods that anonymize the speaker information from speech while preserving the spoken linguistic content. In this pap ...
ACADEMIC PRESS LTD- ELSEVIER SCIENCE LTD2022

Novel Methods for Incorporating Prior Knowledge for Automatic Speech Assessment

Subrahmanya Pavankumar Dubagunta

Speech signal conveys several kinds of information such as a message, speaker identity, emotional state of the speaker and social state of the speaker. Automatic speech assessment is a broad area that refers to using automatic methods to predict human judg ...
EPFL2021

Utterance Verification-Based Dysarthric Speech Intelligibility Assessment Using Phonetic Posterior Features

Mathew Magimai Doss, Julian David Fritsch

In the literature, the task of dysarthric speech intelligibility assessment has been approached through development of different low-level feature representations, subspace modeling, phone confidence estimation or measurement of automatic speech recognitio ...
IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC2021
Show more
Related MOOCs (14)
Neuronal Dynamics 2- Computational Neuroscience: Neuronal Dynamics of Cognition
This course explains the mathematical and computational models that are used in the field of theoretical neuroscience to analyze the collective dynamics of thousands of interacting neurons.
Neuronal Dynamics 2- Computational Neuroscience: Neuronal Dynamics of Cognition
This course explains the mathematical and computational models that are used in the field of theoretical neuroscience to analyze the collective dynamics of thousands of interacting neurons.
Neuronal Dynamics - Computational Neuroscience of Single Neurons
The activity of neurons in the brain and the code used by these neurons is described by mathematical neuron models at different levels of detail.
Show more

Graph Chatbot

Chat with Graph Search

Ask any question about EPFL courses, lectures, exercises, research, news, etc. or try the example questions below.

DISCLAIMER: The Graph Chatbot is not programmed to provide explicit or categorical answers to your questions. Rather, it transforms your questions into API requests that are distributed across the various IT services officially administered by EPFL. Its purpose is solely to collect and recommend relevant references to content that you can explore to help you answer your questions.