Are you an EPFL student looking for a semester project?
Work with us on data science and visualisation projects, and deploy your project as an app on top of Graph Search.
Remote sensing visual question answering (RSVQA) opens new avenues to promote the use of satellites data, by interfacing satellite image analysis with natural language processing. Capitalizing on the remarkable advances in natural language processing and computer vision, RSVQA aims at finding an answer to a question formulated by a human user about a remote sensing image. This is achieved by extracting representations from images and questions, and then fusing them in a joint representation. Focusing on the language part of the architecture, this study compares and evaluates the adequacy to the RSVQA task of two language models, a traditional recurrent neural network (Skip-thoughts) and a recent attentionbased Transformer (BERT). We study whether large transformer models are beneficial to the task and whether fine-tuning is needed for these models to perform at their best. Our findings show that the models benefit from fine-tuning language models and that RSVQA with BERT is slightly but consistently better when properly fine-tuned.
Volkan Cevher, Grigorios Chrysos, Fanghui Liu, Yongtao Wu, Elias Abad Rocamora