The Multi-Stream automatic speech recognition approach was investigated in this work as a framework for Audio-Visual data fusion and speech recognition. This method presents many potential advantages for such a task. It particularly allows for synchronous decoding of continuous speech while still allowing for some asynchrony of the visual and acoustic information streams. First, the Multi-Stream formalism is briefly recalled. Then, on top of the Multi-Stream motivations, experiments on the M2VTS multimodal database are presented and discussed. To our knowledge, these are the first experiments about multi-speaker continuous Audio-Visual Speech Recognition (AVSR). It is shown that the Multi-Stream approach can yield improved Audio-Visual speech recognition performance when the acoustic signal is corrupted by noise as well as for clean speech.
Silvestro Micera, Simone Romeni, Laura Toni, Fiorenzo Artoni
Petr Motlicek, Hynek Hermansky, Sriram Ganapathy, Amrutha Prasad