The goal of this thesis is to improve current state-of-the-art techniques in speaker verification
(SV), typically based on âidentity-vectorsâ (i-vectors) and deep neural network (DNN), by exploiting diverse (phonetic) information extracted using various techniques such as automatic
speech recognition (ASR). Different speakers span different subspaces within a universal acoustic space, usually modelled by âuniversal background modelâ. The speaker-specific subspace
depends on the speakerâs voice characteristics, but also on the verbalised text of a speaker. In current state-of-the-art SV systems, i-vectors are extracted by applying a factor analysis
technique to obtain low dimensional speaker-specific representation. Furthermore, DNN output is also employed in a conventional i-vector framework to model phonetic information
embedded in the speech signal. This thesis proposes various techniques to exploit phonetic knowledge of speech to further enrich speaker characteristics.
More specifically, the techniques proposed in this thesis are applied to various SV tasks,
namely, text-independent and text-dependent SV. For text-independent SV task, several ASR
systems are developed and applied to compute phonetic posterior probabilities, subsequently
exploited to enhance the speaker-specific information included in i-vectors. These approaches
are then extended for text-dependent SV task, exploiting temporal information in a principled
way, i.e., by using dynamic time warping applied on speaker informative vectors.
Finally, as opposed to train DNN with phonetic information, DNN is trained in an end-to-end
fashion to directly discriminate speakers. The baseline end-to-end SV approach consists of
mapping a variable length speech segment to a fixed dimensional speaker vector by estimating
the mean of hidden representations in DNN structure. We improve upon this technique by
computing a distance function between two utterances which takes into account common
phonetic units. The whole network is optimized by employing a triplet-loss objective function.
The proposed approaches are evaluated on commonly used datasets such as NIST SRE 2010
and RSR2015. Significant improvements are observed over the baseline systems on both the
text-dependent and text-independent SV tasks by applying phonetic knowledge.EPFL