This paper describes a multimodal approach for speaker verification. The system consists of two classifiers, one using visual features and the other using acoustic features. A lip tracker is used to extract visual information from the speaking face which provides shape and intensity features. We describe an approach for normalizing and mapping different modalities onto a common confidence interval. We also describe a novel method for integrating the scores of multiple classifiers. Verification experiments are reported for the individual modalities and for the combined classifier. The performance of the integrated system outperformed each sub-system and reduced the false acceptance rate of the acoustic sub-system from 2.3% to 0.5%.
Anthony Christopher Davison, Igor Rodionov
Frédéric Courbin, Georges Meylan, Gianluca Castignani, Maurizio Martinelli, Austin Chandler Peel, Yi Wang, Richard Massey, Fabio Finelli, Marcello Farina
Olga Fink, Ismail Nejjar, Han Sun, Hao Dong