End-to-end text-dependent speaker verification using novel distance measures

This paper explores novel ideas in building end-to-end deep neural network (DNN) based text-dependent speaker verification (SV) system. The baseline approach consists of mapping a variable length speech segment to a fixed dimensional speaker vector by estimating the mean of hidden representations in DNN structure. The distance between two utterances is obtained by computing L2 norm between the vectors. This approach performs worse than the conventional Gaussian Mixture Model-Universal Background Model (GMM-UBM) based SV on a publicly available corpora. We believe that a degraded performance is due to the employed averaging operation, which may not capture the phonetic information of an utterance. Recent studies indicate that techniques exploiting phonetic information in addition to speaker is beneficial for this task. This paper therefore proposes to incorporate content information of the speech signal by computing distance function with linguistic units co-occuring between enrollment and test data. The whole network is optimized by employing a triplet-loss objective in an end-to-end fashion to estimate SV scores. Experiments on the RSR2015 dataset indicate that the proposed approach outperforms GMM-UBM system by 48% and 36% relative equal error rate for fixed-phrase and random-digit conditions respectively.

End-to-end text-dependent speaker verification using novel distance measures

Graph Chatbot

Chattez avec Graph Search

Efficient Transformer-Based Speech Recognition

Stop Wasting my FLOPS: Improving the Efficiency of Deep Learning Models

Multilingual Training and Adaptation in Speech Recognition

Stop Wasting my FLOPS: Improving the Efficiency of Deep Learning Models

Efficient Transformer-Based Speech Recognition

Multilingual Training and Adaptation in Speech Recognition