Publication

Fast and Future: Towards Efficient Forecasting in Video Semantic Segmentation

Abstract

Deep learning has revolutionized the field of computer vision, a success largely attributable to the growing size of models, datasets, and computational power.Simultaneously, a critical pain point arises as several computer vision applications are deployed on low-power embedded devices, necessitating real-time processing capabilities.This challenge intensifies for semantic segmentation, a dense prediction task demanding substantial memory and computational resources.This thesis explores techniques to streamline real-time segmentation networks, enhance their efficiency, and deal with potential ambiguity.First, we introduce a latency-aware segmentation metric, a measure that combines the mean Intersection over Union with the network processing time, providing a practical metric for applied settings.Emphasis is placed on the concept of "anticipation" in real-time networks - these systems should be capable of predicting future input segmentation.Consequently, we then design an anticipatory convolutional network incorporating an inventive convolution layer. This novel layer reduces computation by reusing features from previous video frame computations, exploiting their temporal coherence.Next, we present a method to accelerate transformer-based segmentation networks called `patch-pausing'. This technique halts the processing of image patches deemed to be already correctly segmented by assessing the network's confidence in its prediction.Remarkably, our experimental results indicate that more than half of the patches can be paused early in the process, with a minimal impact on segmentation accuracy.This study concludes with the introduction of a discrete diffusion model for segmentation. This model allows for the sampling of multiple potential segmentations for a given input while accurately following the training data distribution.Combining this diffusion model within an autoregressive scheme, we successfully showcase its capacity to generate long-term future predictions of segmentation.The implementation and evaluation of these approaches contribute to the ongoing efforts to improve real-time segmentation networks and facilitate more efficient deployment of computer vision applications on low-power devices.

About this result
This page is automatically generated and may contain information that is not correct, complete, up-to-date, or relevant to your search query. The same applies to every other page on this website. Please make sure to verify the information with EPFL's official sources.
Related concepts (32)
Image segmentation
In and computer vision, image segmentation is the process of partitioning a into multiple image segments, also known as image regions or image objects (sets of pixels). The goal of segmentation is to simplify and/or change the representation of an image into something that is more meaningful and easier to analyze. Image segmentation is typically used to locate objects and boundaries (lines, curves, etc.) in images. More precisely, image segmentation is the process of assigning a label to every pixel in an image such that pixels with the same label share certain characteristics.
Object co-segmentation
In computer vision, object co-segmentation is a special case of , which is defined as jointly segmenting semantically similar objects in multiple images or video frames. It is often challenging to extract segmentation masks of a target/object from a noisy collection of images or video frames, which involves object discovery coupled with . A noisy collection implies that the object/target is present sporadically in a set of images or the object/target disappears intermittently throughout the video of interest.
Computer vision
Computer vision tasks include methods for , , and understanding digital images, and extraction of high-dimensional data from the real world in order to produce numerical or symbolic information, e.g. in the forms of decisions. Understanding in this context means the transformation of visual images (the input to the retina in the human analog) into descriptions of the world that make sense to thought processes and can elicit appropriate action.
Show more
Related publications (96)

Aggregating Spatial and Photometric Context for Photometric Stereo

David Honzátko

Photometric stereo, a computer vision technique for estimating the 3D shape of objects through images captured under varying illumination conditions, has been a topic of research for nearly four decades. In its general formulation, photometric stereo is an ...
EPFL2024

Enabling Uncertainty Estimation in Iterative Neural Networks

Pascal Fua, Nikita Durasov, Doruk Oner, Minh Hieu Lê

Turning pass-through network architectures into iterative ones, which use their own output as input, is a well-known approach for boosting performance. In this paper, we argue that such architectures offer an additional benefit: The convergence rate of the ...
2024

Modular segmentation, spatial analysis and visualization of volume electron microscopy datasets

Martin Weigert

Volume electron microscopy is the method of choice for the in situ interrogation of cellular ultrastructure at the nanometer scale, and with the increase in large raw image datasets generated, improving computational strategies for image segmentation and s ...
Berlin2024
Show more
Related MOOCs (22)
Digital Signal Processing [retired]
The course provides a comprehensive overview of digital signal processing theory, covering discrete time, Fourier analysis, filter design, sampling, interpolation and quantization; it also includes a
Digital Signal Processing
Digital Signal Processing is the branch of engineering that, in the space of just a few decades, has enabled unprecedented levels of interpersonal communication and of on-demand entertainment. By rewo
Digital Signal Processing I
Basic signal processing concepts, Fourier analysis and filters. This module can be used as a starting point or a basic refresher in elementary DSP
Show more

Graph Chatbot

Chat with Graph Search

Ask any question about EPFL courses, lectures, exercises, research, news, etc. or try the example questions below.

DISCLAIMER: The Graph Chatbot is not programmed to provide explicit or categorical answers to your questions. Rather, it transforms your questions into API requests that are distributed across the various IT services officially administered by EPFL. Its purpose is solely to collect and recommend relevant references to content that you can explore to help you answer your questions.