In this paper, we propose a new approach for the automatic audio-based temporal alignment with confidence estimation of audio-visual data, recorded by different cameras, camcorders or mobile phones during social events. All recorded data is temporally aligned based on ASR-related features with a common master track, recorded by a reference camera, and the corresponding confidence of alignment is estimated. The core of the algorithm is based on perceptual time-frequency analysis with a precision of 10 ms. The results show correct alignment in 99% of cases for a real life dataset and surpass the performance of cross correlation while keeping lower system requirements.
Dusan Licina, Shen Yang, Marouane Merizak, Meixia Zhang
Yiming Li, Frédéric Courbin, Georges Meylan, Yi Wang, Richard Massey, Fabio Finelli, Marcello Farina