Multimodal signal processing analyzes a physical phenomenon through several types of measures, or modalities. This leads to the extraction of higher-quality and more reliable information than that obtained from single-modality signals. The advantage is two-fold. First, as the modalities are usually complementary, the end-result of multimodal processing is more informative than for each of the modalities individually, which represents the first advantage. This is true in all application domains: human-machine interaction, multimodal identification or multimodal image processing. The second advantage is that, as modalities are not always reliable, it is possible, when one modality becomes corrupted, to extract the missing information from the other one. There are two essential challenges in multimodal signal processing. First, the features used from each modality need to be as relevant and as few as possible. The fact that multimodal systems have to process more than just one modality means that they can run into errors caused by the curse of dimensionality much more easily than mono-modal ones. The curse of dimensionality is a term used essentially to say that the number of equally-distributed samples required to cover a region of space grows exponentially with the dimensionality of the space. This has important implications in the classification domain, since accurate models can only be obtained if an adequate number of samples is available, and obviously this required number of samples grows with the dimensionality of the features. Dimensionality reduction is thus a necessary step in any application dealing with complex signals, and this is achieved through selection, transforms or the combination of the two. The second essential challenge is multimodal integration. Since the signals involved do not necessarily have the same data rate, range or even dimensionality, combining information coming from such different sources is not straightforward. This can be done at different levels, starting from the basic signal level by combining the signals themselves, if they are compatible, up to the highest decision level, where only the individual decisions taken based on the signals are combined. Ideally, the fusion method should allow temporal variations in the relative importance of the two streams, to account for possible changes in their quality. However, this can only be done with methods operating at a high decision level. The aim of this thesis is to offer solutions to both these challenges, in the context of audio-visual speech recognition and speaker localization. Both these applications are from the field of human-machine interaction. Audio-visual speech recognition aims to improve the accuracy of speech recognizers by augmenting the audio with information extracted from the video, more particularly, the movement of the speaker's lips. This works well especially when the audio is corrupted, leading in this case to significant gains in accuracy.
Pascal Frossard, Chenglin Li, Li Wei, Qin Yang, Yuelei Li
Lenka Zdeborová, Emanuele Troiani, Giovanni Piccioli