Multimodal feature extraction and fusion for audio-visual speech recognition
Graph Chatbot
Chat with Graph Search
Ask any question about EPFL courses, lectures, exercises, research, news, etc. or try the example questions below.
DISCLAIMER: The Graph Chatbot is not programmed to provide explicit or categorical answers to your questions. Rather, it transforms your questions into API requests that are distributed across the various IT services officially administered by EPFL. Its purpose is solely to collect and recommend relevant references to content that you can explore to help you answer your questions.
Class-specific classifiers for audio, visual and audio-visual speech recognition systems are developed and compared with traditional classifiers. We use state- of-the-art feature extraction methods and show the benefits of a class-specific classifier on ea ...
In this paper, we address the problem of the recognition of isolated, complex, dynamic hand gestures. The goal of this paper is to provide an empirical comparison of two state-of-the-art techniques for temporal event modeling combined with specific feature ...
We present a method for multimodal fusion based on the estimated reliability of each individual modality. Our method uses an information theoretic measure, the entropy derived from the state probability distribution for each stream, as an estimate of relia ...
In this thesis, we investigate the use of posterior probabilities of sub-word units directly as input features for automatic speech recognition (ASR). These posteriors, estimated from data-driven methods, display some favourable properties such as increase ...
Person identification using audio (speech) and visual (facial appearance, static or dynamic) modalities, either independently or jointly, is a thoroughly investigated problem in pattern recognition. In this work, we explore a novel task : person identifica ...
The problem of feature selection has been thoroughly analyzed in the context of pattern classification, with the purpose of avoiding the curse of dimensionality. However, in the context of multimodal signal processing, this problem has been studied less. O ...
Speaker diarization is originally defined as the task of de- termining “who spoke when” given an audio track and no other prior knowledge of any kind. The following article shows a multi-modal approach where we improve a state- of-the-art speaker diarizati ...
The use of omnidirectional cameras for videoconferencing promises to simplify the hardware setup necessary for large groups of participants. We investigate the use of a multimodal speaker detection algorithm on audio-visual sequences captured with such a c ...
We present a method for dynamically integrating audio-visual information for speech recognition, based on the estimated reliability of the audio and visual streams. Our method uses an information theoretic measure, the entropy derived from the state probab ...
In this thesis, we investigate the use of posterior probabilities of sub-word units directly as input features for automatic speech recognition (ASR). These posteriors, estimated from data-driven methods, display some favourable properties such as increase ...