Are you an EPFL student looking for a semester project?
Work with us on data science and visualisation projects, and deploy your project as an app on top of GraphSearch.
State of the art content-based image retrieval algorithms owe their excellent performance to the rich semantics encoded in the deep activations of a convolutional neural network. The difference between these algorithms lies mostly in how activations are combined into a compact global image descriptor. In this paper, we propose to use deep feature factorization to achieve this goal. By factorizing CNN activations, we decompose an input image into semantic regions, represented by both spatial saliency heatmaps and basis vectors serving as descriptors for those regions. When combined to form a global image descriptor, our experiments show that DFF surpasses the state of the art in both image retrieval and localization of the region of interest within the set of retrieved images.
Loading
Loading
Loading
Loading
Loading
black-boxes''. The Law of Parsimony states that
simpler solutions are more likely to be correct than complex ones''. Since they perform quite well in practice, a natural question to ask, then, is in what way are neural networks simple?
We propose that compression is the answer. Since good generalization requires invariance to irrelevant variations in the input, it is necessary for a network to discard this irrelevant information. As a result, semantically similar samples are mapped to similar representations in neural network deep feature space, where they form simple, low-dimensional structures.
Conversely, a network that overfits relies on memorizing individual samples. Such a network cannot discard information as easily.
In this thesis we characterize the difference between such networks using the non-negative rank of activation matrices. Relying on the non-negativity of rectified-linear units, the non-negative rank is the smallest number that admits an exact non-negative matrix factorization.
We derive an upper bound on the amount of memorization in terms of the non-negative rank, and show it is a natural complexity measure for rectified-linear units.
With a focus on deep convolutional neural networks trained to perform object recognition, we show that the two non-negative factors derived from deep network layers decompose the information held therein in an interpretable way. The first of these factors provides heatmaps which highlight similarly encoded regions within an input image or image set. We find that these networks learn to detect semantic parts and form a hierarchy, such that parts are further broken down into sub-parts.
We quantitatively evaluate the semantic quality of these heatmaps by using them to perform semantic co-segmentation and co-localization. In spite of the convolutional network we use being trained solely with image-level labels, we achieve results comparable or better than domain-specific state-of-the-art methods for these tasks.
The second non-negative factor provides a bag-of-concepts representation for an image or image set. We use this representation to derive global image descriptors for images in a large collection. With these descriptors in hand, we perform two variations content-based image retrieval, i.e. reverse image search. Using information from one of the non-negative matrix factors we obtain descriptors which are suitable for finding semantically related images, i.e., belonging to the same semantic category as the query image. Combining information from both non-negative factors, however, yields descriptors that are suitable for finding other images of the specific instance depicted in the query image, where we again achieve state-of-the-art performance.Fayez Lahoud, Sabine Süsstrunk
Olivier Canévet, Jean-Marc Odobez