Interact with Earth Observation Images using AI: Transparent Methods and Evaluation for Visual Question Answering

Now is an exciting time for the domain of Earth observation (EO), with a multitude of diverse sensors looking at the planet from satellites, airplanes or drones. The volume of imagery acquired is massive, and hold great potential for a variety of applications. However, the ability to extract useful insights from the imagery and thus realize the full potential of EO is limited by a technical barrier: the skills necessary to retrieve specific information from this unique resource. While a large proportion of the population has become familiar with optical, very high resolution images, the use of data-driven pipelines to efficiently retrieve the content of interest from data of various spatial and spectral resolutions is technical and task-specific. These limitations create a gap between available EO data and potential, non-specialist end-users. To tackle this challenge, the task of remote sensing visual question answering (RSVQA) proposes to use natural language to enable the interactions, through questions and answers, between EO data and end-users.

The goals of this thesis are to improve the understanding of RSVQA systems by investigating its different parts, as well as to propose transparent and innovative methodologies and evaluation strategies. The emphasis is on transparency to, on one hand, design architectures that enhance the interpretability of the answer predictions by providing supporting insights, and on the other hand, formulate evaluation metrics that better capture the performances and robustness of the systems.

The first part of this thesis is dedicated to analytical studies. Different strategies to combine representations of the images and the questions are compared in terms of performances but also efficiency. Next, the language model encoder used to produce the questions representation is considered, contrasting the previously-standard recurrent neural network with the modern attention-based transformer. The interest to fine-tune the pre-trained encoders is also examined. Since fine-tuning can have consequences on the robustness of a RSVQA model as it learns the language biases present in the dataset, the pitfall of language biases in RSVQA is thoroughly studied to propose evaluation metrics for both datasets and models.

In the second part, the orientation is on methodological development. The prompt-RSVQA architecture describes the image in text that is then provided as context along with the question to a language model. The availability of additional semantic information allows to separately evaluate both modalities. Building on, the multi-task prompt-RSVQA model focuses on explicitly detecting objects in the visual inputs to improve the predictions of numerical questions and directly visualize their answers in the image. In PAN-RSVQA, the variety of perspectives to describe images is further enhanced and the semantic bottleneck imposed in the previous propositions is enriched by using detailed, vector representations of t