This lecture explores the applications of transformers in visual intelligence, focusing on object detection, panoptic segmentation, and high-resolution image synthesis. It covers the use of transformers as a set prediction problem with set-based loss, decoder outputs, and query communication. The lecture also delves into encoder attention map visualizations for global image understanding and the training of Vision Transformers in a self-supervised manner. Additionally, it discusses the success of generative pre-training in vision tasks and the pre-training of image transformers using BERT. The lecture concludes with the concept of generating images with sparse representations using a DC transformer and the fusion of features from multiple cameras into a shared representation.
This video is available exclusively on Mediaspace for a restricted audience. Please log in to MediaSpace to access it if you have the necessary permissions.
Watch on Mediaspace