Non-Intrusive Speech Quality Assessment with Transfer Learning and Subject-specific Scaling

In communication systems, it is crucial to estimate the perceived quality of audio and speech. The industrial standards for many years have been PESQ, 3QUEST, and POLQA, which are intrusive methods. This restricts the possibilities of using these metrics in real-world conditions, where we might not have access to the clean reference signal. In this work, we develop a new non-intrusive metric based on crowd-sourced data. We build a new speech dataset by combining publicly available speech, noises, and reverberations. Then we follow the ITU P.808 recommendation to label the dataset with mean opinion scores (MOS). Finally, we train a deep neural network to estimate the MOS from the speech data in a non-intrusive way. We propose two novelties in our work. First, we explore transfer learning by pre-training a model using a larger set of POLQA scores and finetuning with the smaller (and thus cheaper) human-labeled set. Secondly, we perform a subject-specific scaling in the MOS scores to adjust for their different subjective scales. Our model yields better accuracy than PESQ, POLQA, and other non-intrusive methods when evaluated on the independent VCTK test set. We also report misleading POLQA scores for reverberant speech.

Non-Intrusive Speech Quality Assessment with Transfer Learning and Subject-specific Scaling

Graph Chatbot

Chattez avec Graph Search

Fundamental Limits in Statistical Learning Problems: Block Models and Neural Networks

Deep Learning Generalization with Limited and Noisy Labels

On Breathing Pattern Information in Synthetic Speech

Deep Learning Generalization with Limited and Noisy Labels

Fundamental Limits in Statistical Learning Problems: Block Models and Neural Networks

On Breathing Pattern Information in Synthetic Speech