Êtes-vous un étudiant de l'EPFL à la recherche d'un projet de semestre?
Travaillez avec nous sur des projets en science des données et en visualisation, et déployez votre projet sous forme d'application sur Graph Search.
In this thesis, we present a transformers-based multi-lingual embedding model to represent sentences in different languages in a common space. To do so, our system uses the structure of a simplified transformer with a shared byte-pair encoding vocabulary for two languages (English and French) and trained on publicly available parallel corpora. Also, new objective losses have experimented including a cross-lingual loss and a sentence alignment loss for presenting better representation quality. We evaluate our generated sentence representations on the sentence retrieval task from MUSE, multi-lingual zero-shot document classification and natural language inference task from MLDoc and XNLI respectively compared with competitors like Bi-Bert2Vec (Sabet et al., 2020, LASER (Artetxe and Schwenk, 2019 and Multi-lingual BERT (mBERT proposed by Devlin et al., 2018). Our proposed model obtains state-of-art results on the cross-lingual sentence retrieval task and it outperforms other competitors like Bi-Bert2Vec and LASER on the MLDoc task (Schwenk and Li, 2018) as well. We also experiment with model architectures, objectives and the tensors used to represent sentences and then proposed a new sentence alignment loss which has a positive impact on the quality of sentence representation.
Vinitra Swamy, Jibril Albachir Frej, Paola Mejia Domenzain, Luca Zunino, Tommaso Martorella, Elena Grazia Gado