Are you an EPFL student looking for a semester project?
Work with us on data science and visualisation projects, and deploy your project as an app on top of GraphSearch.
Speech Emotion Recognition (SER) has been shown to benefit from many of the recent advances in deep learning, including recurrent based and attention based neural network architectures as well. Nevertheless, performance still falls short of that of humans. In this work, we investigate whether SER could benefit from the self-attention and global windowing of the transformer model. We show on the IEMOCAP database that this is indeed the case. Finally, we investigate whether using the distribution of, possibly conflicting, annotations in the training data, as soft targets could outperform a majority voting. We prove that this performance increases with the agreement level of the annotators.
Seyedeh Sahar Sadrizadeh, Hatef Otroshi Shahreza