Emergent leaders through looking and speaking: from audio-visual data to multimodal recognition
Graph Chatbot
Chat with Graph Search
Ask any question about EPFL courses, lectures, exercises, research, news, etc. or try the example questions below.
DISCLAIMER: The Graph Chatbot is not programmed to provide explicit or categorical answers to your questions. Rather, it transforms your questions into API requests that are distributed across the various IT services officially administered by EPFL. Its purpose is solely to collect and recommend relevant references to content that you can explore to help you answer your questions.
The recognition of speech in meetings poses a number of challenges to current Automatic Speech Recognition (ASR) techniques. Meetings typically take place in rooms with non-ideal acoustic conditions and significant background noise, and may contain large s ...
We address the problem of recognizing, in dynamic meetings in which people do not remain seated all the time, the visual focus of attention (VFOA) of seated people from their head pose and contextual activity cues. We propose a model that comprises the VFO ...
The AMI Meeting Corpus is a multi-modal data set consisting of 100 hours of meeting recordings. It is being created in the context of a project that is developing meeting browsing technology and will eventually be released publicly. Some of the meetings it ...
The AMI Meeting Corpus is a multi-modal data set consisting of 100 hours of meeting recordings. It is being created in the context of a project that is developing meeting browsing technology and will eventually be released publicly. Some of the meetings it ...
A quantitative measure of relevance is proposed for the task of constructing visual feature sets which are at the same time relevant and compact. A feature's relevance is given by the amount of information that it contains about the problem, while compactn ...
We address the problem of recognizing, in dynamic meetings in which people do not remain seated all the time, the visual focus of attention (VFOA) of seated people from their head pose and contextual activity cues. We propose a model that comprises the VFO ...
Humans perceive their surrounding environment in a multimodal manner by using multi-sensory inputs combined in a coordinated way. Various studies in psychology and cognitive science indicate the multimodal nature of human speech production and perception. ...
We present a method for dynamically integrating audio-visual information for speech recognition, based on the estimated reliability of the audio and visual streams. Our method uses an information theoretic measure, the entropy derived from the state probab ...
Visual attention models mimic the ability of a visual system, to detect potentially relevant parts of a scene. This process of attentional selection is a prerequisite for higher level tasks such as object recognition. Given the high relevance of temporal a ...
Visual attention, defined as the ability of a biological or artificial vision system to rapidly detect potentially relevant parts of a visual scene, provides a general purpose solution for low level feature detection in a vision architecture. Well consider ...