Are you an EPFL student looking for a semester project?
Work with us on data science and visualisation projects, and deploy your project as an app on top of Graph Search.
Visual Question Answering (VQA) on remote sensing imagery can help non-expert users in extracting information from Earth observation data. Current approaches follow a neural encoder-decoder design, combining convolutional and recurrent encoders together with cross-modal fusion components. However, in other VQA application domains, the current state-of-the-art methods rely on self-attention, employing multi-modal encoders based on the Transformer architecture. In this work, we assess the degree to which a model based on self-attention can bring improvements over previous methods for remote sensing VQA. We specifically present results with an extended version of a previous model named MM-BERT, originally proposed for medical VQA and which does not require the extraction of region features from the images, or model pre-training with extensive amounts of data. Experiments show that the proposed method can improve results over previous approaches. Even without in-domain pre-training or specific adaptations to the remote sensing domain, and using as input low-resolution versions of the images, we can achieve a high accuracy over three different datasets extensively used in previous studies.