Êtes-vous un étudiant de l'EPFL à la recherche d'un projet de semestre?
Travaillez avec nous sur des projets en science des données et en visualisation, et déployez votre projet sous forme d'application sur Graph Search.
Speech signal conveys several kinds of information such as a message, speaker identity, emotional state of the speaker and social state of the speaker. Automatic speech assessment is a broad area that refers to using automatic methods to predict human judgements regarding different kinds of information conveyed in speech, such as intelligibility of the spoken message, dialect and fluency of the speaker. Unlike other speech technology areas, such as automatic speech recognition, text-to-speech synthesis and automatic speaker recognition, automatic speech assessment is an emerging direction of research. One of the challenges in this field is that there is no single method or framework that scales across diverse speech assessment tasks. Thus, this thesis takes a broader outlook and focuses on prior knowledge incorporation for diverse data-driven speech assessment problems.First, we focus on the development of end-to-end acoustic modelling methods for non-verbal cue-based speech assessment. More precisely, we develop neural network-based methods that can integrate prior knowledge about speech production to learn to assess speech from raw waveform. We validate the developed methods through investigations on several speech assessment tasks, viz. dialect identification, depression detection and speech fluency rating prediction.Second, we focus on advancing a recently proposed phone posterior feature-based intelligibility estimation technique. Specifically, to enhance phone posterior probability estimation, we propose two novel approaches to incorporate linguistic segment level knowledge during the training of neural networks through estimation of confidence measures. We validate the two proposed approaches through automatic speech recognition and dysarthric speech intelligibility assessment studies.Finally, in the context of privacy preservation, we develop a signal processing-based speech pseudonymization approach that alters voice source information and vocal tract system information based on prior knowledge to obfuscate the speaker identity, while retaining intelligibility, i.e. the phones and words remain recognizable. We validate the proposed pseudonymization approach through listening experiments and automatic evaluations.