Are you an EPFL student looking for a semester project?
Work with us on data science and visualisation projects, and deploy your project as an app on top of Graph Search.
In this report, we build up on our previous work on speaker clustering, where the number of speakers and segmentation boundaries are unknown a priori. We employ an ergodic HMM with minimum duration topology for this purpose. Starting from a large number of clusters in the beginning, we merge a pair of clusters in every iteration. A new criterion for the merging of two clusters is proposed, which ensures an increase in likelihood of the data. The merging is done in such a way that, the total number of parameters needed to model all the clusters remain same. Thus, the system finally achieves maximum likelihood (which was not the case in our last work) with the constant number of parameters. The merging process is repeated until there are no candidates available for merging. The efficiency and advantages of using only highly voiced frames was reported in our previous work, and we use the same features for this work. The system was evaluated on Hub-4 1996 evaluation set. Improvements over the previous work are reported and it is shown that, the system converges to right number of clusters in case of limited number of speakers.
Wulfram Gerstner, Johanni Michael Brea, Alireza Modirshanechi, Shuqi Wang
Andrea Wulzer, Alfredo Glioti, Siyu Chen