Multi-task prompt-RSVQA to explicitly count objects on aerial images

Introduced to enable a wider use of Earth Observation images using natural language, Remote Sensing Visual Question Answering (RSVQA) remains a challenging task, in particular for questions related to counting. To address this specific challenge, we propose a modular Multi-task prompt-RSVQA model based on object detection and question answering modules. By creating a semantic bottleneck describing the image and providing a visual answer, our model allows users to assess the visual grounding of the answer and better interpret the prediction. A set of ablation studies are designed to consider the contributions of different modules and evaluation metrics are discussed for a finer-grained assessment. Experiments demonstrate competitive results against literature baselines and a zero-shot VQA model. In particular, our proposed model predicts answers for numerical Counting questions that are consistently closer in distance to the ground truth.

Chattez avec Graph Search

Posez n’importe quelle question sur les cours, conférences, exercices, recherches, actualités, etc. de l’EPFL ou essayez les exemples de questions ci-dessous.

AVERTISSEMENT : Le chatbot Graph n'est pas programmé pour fournir des réponses explicites ou catégoriques à vos questions. Il transforme plutôt vos questions en demandes API qui sont distribuées aux différents services informatiques officiellement administrés par l'EPFL. Son but est uniquement de collecter et de recommander des références pertinentes à des contenus que vous pouvez explorer pour vous aider à répondre à vos questions.

Multi-task prompt-RSVQA to explicitly count objects on aerial images

Graph Chatbot

Chattez avec Graph Search

Revisiting Offline Compression: Going Beyond Factorization-based Methods for Transformer Language Models

Modeling Structured Data in Attention-based Models

Framing the News: From Human Perception to Large Language Model Inferences

Revisiting Offline Compression: Going Beyond Factorization-based Methods for Transformer Language Models

Modeling Structured Data in Attention-based Models

Framing the News: From Human Perception to Large Language Model Inferences