Are you an EPFL student looking for a semester project?
Work with us on data science and visualisation projects, and deploy your project as an app on top of GraphSearch.
The problem of style transfer consists in transferring the style from one signal to another while preserving the latter’s content. This project explores the applications of style transfer techniquesto speech signals. In particular, such techniques are used to address Voice Conversion (VC). This problem can be formulated with the style transfer framework and consists of changing the speaker identity (style) of a speech signal at will while preserving the same linguistic information (content). Style transfer is an inherently ill-posed problem; i.e., there is not a unique solution. Moreover, there are no standardized objective measures to evaluate the results. The effect of this is twofold. Firstly, the lack of such metrics hinders the training process. Secondly, it is hard to benchmark different methods. The first problem is tackled by using an AutoEncoder (AE) architecture. The raw speech is mapped to a lower-dimensional latent space where linguistic and speaker content is separated. During training, the model learns to reconstruct the original raw speech from this representation. When performing VC, the latent representation is modified to match the target speaker. This work presents two variants of this approach named FastVC and PhonetVC. The problem with comparing to the state-of-the-art is solved with the large-scale crowd-sourced perceptual evaluations performed in the Voice Conversion Challenge. The 2020 edition of this challenge centers in non-parallel VC. In particular, a variant of FastVC was submitted for the cross-lingual task, where the target and source speakers speak different languages. FastVC outperformed the VC Challenge baselines and ranked in the top half of the classification in Mean Opinion Score (MOS) quality results among all the participants. The unsupervised representations found with FastVC are shown to be speaker-independent and easily mapped to the human phoneme alphabet. The analysis of such representations confirms that encoder-decoder architectures allow disentangling the style and content of a speech signal. Overall, FastVC offers high-quality results for the task of VC while providing speedy conversions.
Paul Arthur Adrien Pierre Dreyfus