Assessing the quality of a speaker localization or tracking algorithm on a few short examples is difficult, especially when the ground-truth is absent or not well defined. One step towards systematic performance evaluation of such algorithms is to provide time-continuous speaker location annotation over a series of real recordings, covering various test cases. Areas of interest include audio, video and audio-visual speaker localization and tracking. The desired location annotation can be either 2-dimensional (image plane) or 3-dimensional (physical space). This paper motivates and describes a corpus of audio-visual data called AV16.3'', along with a method for 3-D location annotation based on calibrated cameras.
16.3'' stands for 16 microphones and 3 cameras, recorded in a fully synchronized manner, in a meeting room. Part of this corpus has already been successfully used to report research results.
Dalia Salem Hassan Fahmy El Badawy
Hervé Bourlard, Volkan Cevher, Afsaneh Asaei, Mohammadjavad Taghizadeh, Saeid Haghighatshoar