Visual Question Answering (VQA) on remote sensing imagery can help non-expert users in extracting information from Earth observation data. Current approaches follow a neural encoder-decoder design, combining convolutional and recurrent encoders together with cross-modal fusion components. However, in other VQA application domains, the current state-of-the-art methods rely on self-attention, employing multi-modal encoders based on the Transformer architecture. In this work, we assess the degree to which a model based on self-attention can bring improvements over previous methods for remote sensing VQA. We specifically present results with an extended version of a previous model named MM-BERT, originally proposed for medical VQA and which does not require the extraction of region features from the images, or model pre-training with extensive amounts of data. Experiments show that the proposed method can improve results over previous approaches. Even without in-domain pre-training or specific adaptations to the remote sensing domain, and using as input low-resolution versions of the images, we can achieve a high accuracy over three different datasets extensively used in previous studies.
Jean-Baptiste Francis Marie Juliette Cordonnier