Do Hoang Nam Le
Multimedia databases are growing rapidly in size in the digital age. To increase the value of these data and to enhance the user experience, there is a need to make these videos searchable through automatic indexing. Because people appearing and talking in the videos are often of high interest for end users, indices that represent the location and identity of people in the archive are indispensable for video search and browsing tools. On the other hand, multimedia videos contain resourceful data of people in both visual and auditory domains. This offers a potential for multimodal learning in the task of human identification. Hence, the main theme of this thesis is on algorithms to create indexes and exploit the audio-visual correspondence in large multimedia corpuses based on person identities.
First, this thesis deals with algorithms to create indexes through person discovery in videos. It involves several components: face and speaker diarization, face-voice association, and person naming. To obtain face clusters, we propose a novel face tracking approach that leverages face detectors with a tracking-by-detection framework relying on long term time-interval sensitive association costs. We use also shot context to further accelerate and improve face clustering. Face clusters are then associated to speaker clusters using dubbing and talking detection, in which a multimodal framework is introduced to represent the temporal relationship between the auditory and visual streams. We also improve speaker embeddings for recognition and clustering by using a regularizer called intra-class loss.
In the second half, the thesis focuses on multimodal learning with face-voice data. Here, we aim to answer two research questions. First, can one improve a voice embedding using knowledge transferred from a face representation? We investigate several transfer learning approaches to constrain the target voice embedding space to share latent attributes with the source face embedding space. The crossmodal constrains act as regularizers helping voice models, especially in the low-data setting. The second question is can face clusters be used as training labels to learn a speaker embedding? To answer this, we explore the tolerance of embedding losses under label uncertainty. From the risk minimization perspective, we obtain the analytical results that provide the heuristics in strategies to improve the tolerance against label noise. We apply the findings into our task of learning speaker embeddings using face clusters as labels. While the experimental results agree with the analytical heuristics, there is still a large gap in performance between the supervised and the weakly supervised models, which requires further investigation in the future.EPFL