Stop Pre-Training: Adapt Visual-Language Models to Unseen Languages

Karl Aberer, Rémi Philippe Lebret, Negar Foroutan Eghlidi
2023
Article de conférence

Résumé

Vision-Language Pre-training (VLP) has advanced the performance of many visionlanguage tasks, such as image-text retrieval, visual entailment, and visual reasoning. The pre-training mostly utilizes lexical databases and image queries in English. Previous work has demonstrated that the pre-training in English does not transfer well to other languages in a zero-shot setting. However, multilingual pre-trained language models (MPLM) have excelled at a variety of single-modal language tasks. In this paper, we propose a simple yet efficient approach to adapt VLP to unseen languages using MPLM. We utilize a cross-lingual contextualized token embeddings alignment approach to train text encoders for non-English languages. Our approach does not require image input and primarily uses machine translation, eliminating the need for target language data. Our evaluation across three distinct tasks (image-text retrieval, visual entailment, and natural language visual reasoning) demonstrates that this approach outperforms the state-of-the-art multilingual vision-language models without requiring large parallel corpora. Our code is available at https://github.com/Yasminekaroui/CliCoTea.

Source officielle

https://infoscience.epfl.ch/record/310756?ln=fr

À propos de ce résultat

Cette page est générée automatiquement et peut contenir des informations qui ne sont pas correctes, complètes, à jour ou pertinentes par rapport à votre recherche. Il en va de même pour toutes les autres pages de ce site. Veillez à vérifier les informations auprès des sources officielles de l'EPFL.

Stop Pre-Training: Adapt Visual-Language Models to Unseen Languages

Graph Chatbot

Chattez avec Graph Search

Driving and suppressing the human language network using large language models

Infusing structured knowledge priors in neural models for sample-efficient symbolic reasoning

Advancing Self-Supervised Deep Learning for 3D Scene Understanding

Driving and suppressing the human language network using large language models

Infusing structured knowledge priors in neural models for sample-efficient symbolic reasoning

Advancing Self-Supervised Deep Learning for 3D Scene Understanding