Publication

Serab: A Multi-Lingual Benchmark For Speech Emotion Recognition

Graph Chatbot

Chattez avec Graph Search

Posez n’importe quelle question sur les cours, conférences, exercices, recherches, actualités, etc. de l’EPFL ou essayez les exemples de questions ci-dessous.

AVERTISSEMENT : Le chatbot Graph n'est pas programmé pour fournir des réponses explicites ou catégoriques à vos questions. Il transforme plutôt vos questions en demandes API qui sont distribuées aux différents services informatiques officiellement administrés par l'EPFL. Son but est uniquement de collecter et de recommander des références pertinentes à des contenus que vous pouvez explorer pour vous aider à répondre à vos questions.

Connectez-vous pour utiliser Chat avec Graph Search

Self-attention for Speech Emotion Recognition

Philip Neil Garner, Lorenzo Tarantino

Speech Emotion Recognition (SER) has been shown to benefit from many of the recent advances in deep learning, including recurrent based and attention based neural network architectures as well. Nevertheless, performance still falls short of that of humans. ...

2019

Temporal Spiking Recurrent Neural Network for Action Recognition

Wei Wang, Siyuan Hao

In this paper, we propose a novel temporal spiking recurrent neural network (TSRNN) to perform robust action recognition in videos. The proposed TSRNN employs a novel spiking architecture which utilizes the local discriminative features from high-confidenc ...

2019

Segment-level training of ANNs based on acoustic confidence measures for hybrid HMM/ANN Speech Recognition

Subrahmanya Pavankumar Dubagunta

We show that confidence measures estimated from local posterior probabilities can serve as objective functions for training ANNs in hybrid HMM based speech recognition systems. This leads to a segment-level training paradigm that overcomes the limitation o ...

IEEE2019

Understanding and Visualizing Raw Waveform-based CNNs

Sébastien Marcel, Hannah Muckenhirn

Modeling directly raw waveforms through neural networks for speech processing is gaining more and more attention. Despite its varied success, a question that remains is: what kind of information are such neural networks capturing or learning for different ...

2019

CONTEXT-AWARE ATTENTION MECHANISM FOR SPEECH EMOTION RECOGNITION

Philip Neil Garner, Gaétan Ramet

In this work, we study the use of attention mechanisms to enhance the performance of the state-of-the-art deep learning model in Speech Emotion Recognition (SER). We introduce a new Long Short-Term Memory (LSTM)-based neural network attention model which i ...

IEEE2018

Evolution of Neural Network Architectures for Speech Recognition

Hervé Bourlard

Over these last few years, the use of Artificial Neural Networks (ANNs), now often referred to as deep learning or Deep Neural Networks (DNNs), has significantly reshaped research and development in a variety of signal and information processing tasks. Whi ...

ISCA-INT SPEECH COMMUNICATION ASSOC2018

Learning embeddings: efficient algorithms and applications

Cijo Jose

Learning to embed data into a space where similar points are together and dissimilar points are far apart is a challenging machine learning problem. In this dissertation we study two learning scenarios that arise in the context of learning embeddings and o ...

EPFL2018

Phonetic aware techniques for Speaker Verification

Subhadeep Dey

The goal of this thesis is to improve current state-of-the-art techniques in speaker verification (SV), typically based on âidentity-vectorsâ (i-vectors) and deep neural network (DNN), by exploiting diverse (phonetic) information extracted using variou ...

EPFL2018

Learning embeddings: efficient algorithms and applications

Cijo Jose

École Polytechnique Fédérale de Lausanne2018

Joint Localization and Classification of Multiple Sound Sources Using a Multi-task Neural Network

Jean-Marc Odobez, Petr Motlicek, Weipeng He

We propose a novel multi-task neural network-based approach for joint sound source localization and speech/non-speech classification in noisy environments. The network takes raw short time Fourier transform as input and outputs the likelihood values for th ...

ISCA-INT SPEECH COMMUNICATION ASSOC2018