Accurate detection and segmentation of spontaneous multi-party speech is crucial for a variety of applications, including speech acquisition and recognition, as well as higher-level event recognition. However, the highly sporadic nature of spontaneous speech makes this task difficult. Moreover, multi-party speech contains many overlaps. We propose to attack this problem as a tracking task, using location cues only. In order to best deal with high sporadicity, we propose a novel, generic, short-term clustering algorithm that can track multiple objects for a low computational cost. The proposed approach is online, fully deterministic and can run in real-time. In an application to real meeting data, the algorithm produces high precision speech segmentation.
Zhen Wei, Zhiye Wang, Peixia Li
David Atienza Alonso, Miguel Peon Quiros, Pasquale Davide Schiavone, Rubén Rodríguez Álvarez, Denisa-Andreea Constantinescu, Dimitrios Samakovlis, Stefano Albini