Remote sensing visual question answering with a self-attention multi-modal encoder

Visual Question Answering (VQA) on remote sensing imagery can help non-expert users in extracting information from Earth observation data. Current approaches follow a neural encoder-decoder design, combining convolutional and recurrent encoders together with cross-modal fusion components. However, in other VQA application domains, the current state-of-the-art methods rely on self-attention, employing multi-modal encoders based on the Transformer architecture. In this work, we assess the degree to which a model based on self-attention can bring improvements over previous methods for remote sensing VQA. We specifically present results with an extended version of a previous model named MM-BERT, originally proposed for medical VQA and which does not require the extraction of region features from the images, or model pre-training with extensive amounts of data. Experiments show that the proposed method can improve results over previous approaches. Even without in-domain pre-training or specific adaptations to the remote sensing domain, and using as input low-resolution versions of the images, we can achieve a high accuracy over three different datasets extensively used in previous studies.

Remote sensing visual question answering with a self-attention multi-modal encoder

Graph Chatbot

Chat with Graph Search

Infusing structured knowledge priors in neural models for sample-efficient symbolic reasoning

Driving and suppressing the human language network using large language models

Task-driven neural network models predict neural dynamics of proprioception: Neural network model weights

Infusing structured knowledge priors in neural models for sample-efficient symbolic reasoning

Task-driven neural network models predict neural dynamics of proprioception: Neural network model weights

Driving and suppressing the human language network using large language models