Publication

How crowding challenges (feedforward) convolutional neural networks

Abstract

Are (feedforward) convolutional neural networks (CNNs) good models for the human visual system? Here, we used visual crowding as a well-controlled psychophysical test to probe CNNs. Visual crowding is a ubiquitous breakdown of object recognition in the human visual system, whereby targets become jumbled and unrecognisable in the presence of flanking objects. Humans exhibit several well-documented effects of crowding, such as invariance to size, where the size of the target and flanker letters may be changed without impacting the strength of crowding. We show that feedforward CNNs are unable to reproduce invariance to size, confusion between target and flanker identities, and importantly uncrowding, where paradoxically increasing the number of flankers improves performance. We investigate this phenomenon using a recurrent, neurally inspired model called LAMINART, which we find can reproduce uncrowding as observed in humans. Furthermore, we show that capsule networks, a recurrent family of CNNs with grouping and segmentation mechanisms, outperform any other models of uncrowding to date, demonstrating the importance of grouping and segmentation in mechanisms in visual information processing in general.

About this result
This page is automatically generated and may contain information that is not correct, complete, up-to-date, or relevant to your search query. The same applies to every other page on this website. Please make sure to verify the information with EPFL's official sources.
Related concepts (24)
Convolutional neural network
Convolutional neural network (CNN) is a regularized type of feed-forward neural network that learns feature engineering by itself via filters (or kernel) optimization. Vanishing gradients and exploding gradients, seen during backpropagation in earlier neural networks, are prevented by using regularized weights over fewer connections. For example, for each neuron in the fully-connected layer 10,000 weights would be required for processing an image sized 100 × 100 pixels.
Visual system
The visual system comprises the sensory organ (the eye) and parts of the central nervous system (the retina containing photoreceptor cells, the optic nerve, the optic tract and the visual cortex) which gives organisms the sense of sight (the ability to detect and process visible light) as well as enabling the formation of several non-image photo response functions. It detects and interprets information from the optical spectrum perceptible to that species to "build a representation" of the surrounding environment.
Visual perception
Visual perception is the ability to interpret the surrounding environment through photopic vision (daytime vision), color vision, scotopic vision (night vision), and mesopic vision (twilight vision), using light in the visible spectrum reflected by objects in the environment. This is different from visual acuity, which refers to how clearly a person sees (for example "20/20 vision"). A person can have problems with visual perceptual processing even if they have 20/20 vision.
Show more
Related publications (76)

Predicting Visual Stimuli From Cortical Response Recorded With Wide-Field Imaging in a Mouse

Silvestro Micera, Daniela De Luca

Neural decoding of the visual system is a subject of research interest, both to understand how the visual system works and to be able to use this knowledge in areas, such as computer vision or brain-computer interfaces. Spike-based decoding is often used, ...
Ieee-Inst Electrical Electronics Engineers Inc2024

Alpha peak frequency affects visual performance beyond temporal resolution

Michael Herzog, David Pascucci, Maëlan Quentin Menétrey, Maya Roinishvili

Recent work suggests that the individual alpha peak frequency (IAPF) reflects the temporal resolution of visual processing: individuals with higher IAPF can segregate stimuli at shorter intervals compared to those with lower IAPF. However, this evidence ma ...
2024

SVGC-AVA: 360-Degree Video Saliency Prediction With Spherical Vector-Based Graph Convolution and Audio-Visual Attention

Pascal Frossard, Chenglin Li, Li Wei, Qin Yang, Yuelei Li, Hao Wang

Viewers of 360-degree videos are provided with both visual modality to characterize their surrounding views and audio modality to indicate the sound direction. Though both modalities are important for saliency prediction, little work has been done by joint ...
Ieee-Inst Electrical Electronics Engineers Inc2024
Show more
Related MOOCs (7)
Neuronal Dynamics 2- Computational Neuroscience: Neuronal Dynamics of Cognition
This course explains the mathematical and computational models that are used in the field of theoretical neuroscience to analyze the collective dynamics of thousands of interacting neurons.
Neuronal Dynamics 2- Computational Neuroscience: Neuronal Dynamics of Cognition
This course explains the mathematical and computational models that are used in the field of theoretical neuroscience to analyze the collective dynamics of thousands of interacting neurons.
Selected chapters form winterschool on multi-scale brain
Understanding the brain requires an integrated understan­ding of different scales of organisation of the brain. This Massive Open Online Course (MOOC) will take the you through the latest data, models
Show more