Are you an EPFL student looking for a semester project?
Work with us on data science and visualisation projects, and deploy your project as an app on top of GraphSearch.
In this thesis, we present a transformers-based multi-lingual embedding model to represent sentences in different languages in a common space. To do so, our system uses the structure of a simplified transformer with a shared byte-pair encoding vocabulary for two languages (English and French) and trained on publicly available parallel corpora. Also, new objective losses have experimented including a cross-lingual loss and a sentence alignment loss for presenting better representation quality. We evaluate our generated sentence representations on the sentence retrieval task from MUSE, multi-lingual zero-shot document classification and natural language inference task from MLDoc and XNLI respectively compared with competitors like Bi-Bert2Vec (Sabet et al., 2020, LASER (Artetxe and Schwenk, 2019 and Multi-lingual BERT (mBERT proposed by Devlin et al., 2018). Our proposed model obtains state-of-art results on the cross-lingual sentence retrieval task and it outperforms other competitors like Bi-Bert2Vec and LASER on the MLDoc task (Schwenk and Li, 2018) as well. We also experiment with model architectures, objectives and the tensors used to represent sentences and then proposed a new sentence alignment loss which has a positive impact on the quality of sentence representation.