Publication

Unsupervised Visual Entity Abstraction towards 2D and 3D Compositional Models

Beril Besbinar
2022
EPFL thesis
Abstract

Object-centric learning has gained significant attention over the last years as it can serve as a powerful tool to analyze complex scenes as a composition of simpler entities. Well-established tasks in computer vision, such as object detection or instance segmentation, are generally posed in supervised settings. The recent surge of fully-unsupervised approaches for entity abstraction, which often tackle the problem with generative modeling or self-supervised learning, indicates the rising interest in structured representations in the form of objects or object parts. Indeed, these can provide benefits to many challenging tasks in visual analysis, reasoning, forecasting, and planning, and provide a path for combinatorial generalization. In this thesis, we exploit different consistency constraints for disambiguating entities in fully-unsupervised settings. We first consider videos and infer entities that can be modeled by consistent motion between frames at different time steps. We unconventionally opt for representing objects with amodal masks and investigate methods to accumulate information about each entity throughout time for an occlusion-aware decomposition. Approximating motion with parametric spatial transformations enables us to impose cyclic long-term consistency that contributes to reasoning about unseen parts of entities. We then develop a video prediction model based on this decomposition scheme. As the proposed decomposition decouples motion from entity appearance, we attribute the inherent stochasticity of the video prediction problem to our parametric motion model and propose a three-stage training scheme for more plausible prediction outcomes. After deterministic decomposition at the first stage, we train our new model for short-term prediction in stochastic settings. Long-term prediction as the last step helps us learn the distribution of motion present in the dataset for each entity. Finally, we focus on multi-view image settings and assume two different arrangements where the scene is observed from different viewpoints in both cases. We attempt to find correspondences of the volumetric representations of those observations that are guided by differentiable rendering algorithms. By grouping the volume units based on consistent matching of features, we partition the volumetric representation that leads to the individual rendering of each inferred entity. We present promising outcomes for all of the proposed unsupervised object-representation schemes on synthetic datasets and present different ideas for scaling them up for the adaptation to real-world data as future work.

About this result
This page is automatically generated and may contain information that is not correct, complete, up-to-date, or relevant to your search query. The same applies to every other page on this website. Please make sure to verify the information with EPFL's official sources.
Related concepts (36)
Feature learning
In machine learning, feature learning or representation learning is a set of techniques that allows a system to automatically discover the representations needed for feature detection or classification from raw data. This replaces manual feature engineering and allows a machine to both learn the features and use them to perform a specific task. Feature learning is motivated by the fact that machine learning tasks such as classification often require input that is mathematically and computationally convenient to process.
Image segmentation
In and computer vision, image segmentation is the process of partitioning a into multiple image segments, also known as image regions or image objects (sets of pixels). The goal of segmentation is to simplify and/or change the representation of an image into something that is more meaningful and easier to analyze. Image segmentation is typically used to locate objects and boundaries (lines, curves, etc.) in images. More precisely, image segmentation is the process of assigning a label to every pixel in an image such that pixels with the same label share certain characteristics.
Machine learning
Machine learning (ML) is an umbrella term for solving problems for which development of algorithms by human programmers would be cost-prohibitive, and instead the problems are solved by helping machines 'discover' their 'own' algorithms, without needing to be explicitly told what to do by any human-developed algorithms. Recently, generative artificial neural networks have been able to surpass results of many previous approaches.
Show more
Related publications (136)

Advancing Self-Supervised Deep Learning for 3D Scene Understanding

Seyed Mohammad Mahdi Johari

Recent advancements in deep learning have revolutionized 3D computer vision, enabling the extraction of intricate 3D information from 2D images and video sequences. This thesis explores the application of deep learning in three crucial challenges of 3D com ...
EPFL2024

Aggregating Spatial and Photometric Context for Photometric Stereo

David Honzátko

Photometric stereo, a computer vision technique for estimating the 3D shape of objects through images captured under varying illumination conditions, has been a topic of research for nearly four decades. In its general formulation, photometric stereo is an ...
EPFL2024

Fast and Future: Towards Efficient Forecasting in Video Semantic Segmentation

Evann Pierre Guy Courdier

Deep learning has revolutionized the field of computer vision, a success largely attributable to the growing size of models, datasets, and computational power.Simultaneously, a critical pain point arises as several computer vision applications are deployed ...
EPFL2024
Show more

Graph Chatbot

Chat with Graph Search

Ask any question about EPFL courses, lectures, exercises, research, news, etc. or try the example questions below.

DISCLAIMER: The Graph Chatbot is not programmed to provide explicit or categorical answers to your questions. Rather, it transforms your questions into API requests that are distributed across the various IT services officially administered by EPFL. Its purpose is solely to collect and recommend relevant references to content that you can explore to help you answer your questions.