**Êtes-vous un étudiant de l'EPFL à la recherche d'un projet de semestre?**

Travaillez avec nous sur des projets en science des données et en visualisation, et déployez votre projet sous forme d'application sur GraphSearch.

Publication# Human-Centered Scene Understanding via Crowd Counting

Résumé

Human-centered scene understanding is the process of perceiving and analysing a dynamic scene observed through a network of sensors with emphasis on human-related activities. It includes the visual perception of human-related activities from either single image or video sequence. Scene understanding with focus of human-related activities is becoming increasingly popular which results in the demand of algorithms that can efficiently model crowd activity in different real-world scenarios. In this thesis, we exploit human-centered scene understanding through crowd counting. Counting people is a challenging task due to perspective distortion and occlusion. We tackle these problems by developing algorithms to leverage a variety of data modalities including single image, video sequence and scene perspective map. First, we introduce an end-to-end trainable deep architecture for crowd counting that combines features obtained using multiple receptive field sizes and learns the importance of each such feature at each image location. In other words, our approach adaptively encodes the scale of the contextual information required to accurately predict crowd density. This yields an algorithm that outperforms previous crowd counting methods, especially when perspective effects are strong.Second, we explicitly model the scale changes and reason in terms of people per square-meter. We show that feeding the perspective model to the network allows us to enforce global scale consistency and that this model can be obtained on the fly from the drone sensors. In addition, it also enables us to enforce physically-inspired temporal consistency constraints that do not have to be learned. This yields an algorithm that outperforms previous methods in inferring crowd density from a moving drone camera especially when perspective effects are strong.Third, for video sequence, we advocate estimating people flows across image locations between consecutive images and inferring the people densities from these flows instead of directly regressing them. This enables us to impose much stronger constraints encoding the conservation of the number of people. As a result, it significantly boosts performance without requiring a more complex architecture. Furthermore, it allows us to exploit the correlation between people flow and optical flow to further improve the results. We also show that leveraging people conservation constraints in both a spatial and temporal manner makes it possible to train a deep crowd counting model in an active learning setting with much fewer annotations. This significantly reduces the annotation cost while still leading to similar performance to the full supervision case.

Official source

Cette page est générée automatiquement et peut contenir des informations qui ne sont pas correctes, complètes, à jour ou pertinentes par rapport à votre recherche. Il en va de même pour toutes les autres pages de ce site. Veillez à vérifier les informations auprès des sources officielles de l'EPFL.

Concepts associés

Chargement

Publications associées

Chargement

Concepts associés (21)

Segmentation d'image

La segmentation d'image est une opération de s consistant à détecter et rassembler les pixels suivant des critères, notamment d'intensité ou spatiaux, l'image apparaissant ainsi formée de régions uni

Flux optique

vignette|400px|Le flux optique perçu par un observateur en rotation (dans ce cas, une mouche). Les flèches représentent la direction et la vitesse du mouvement.
Le flux optique est le mouvement appar

Inférence (logique)

L’inférence est un mouvement de la pensée qui permet de passer d'une ou plusieurs assertions, des énoncés ou propositions affirmés comme vrais, appelés prémisses, à une nouvelle assertion qui en est

Publications associées (31)

Chargement

Chargement

Chargement

Humans have the ability to learn. Having seen an object we can recognise it later. We can do this because our nervous system uses an efficient and robust visual processing and capabilities to learn from sensory input. On the other hand, designing algorithms to learn from visual data is a difficult task. More than fifty years ago, Rosenblatt proposed the perceptron algorithm. The perceptron learns from data examples a linear separation, which categorises the data in two classes. The algorithm served as a simple model of neuronal learning. Two further important ideas were added to the perceptron. First, to look for a maximal margin of separation. Second, to separate the data in a possibly high dimensional feature space, related nonlinearly to the initial space of the data, and allowing nonlinear separations. Important is that learning in the feature space can be performed implicitly and hence efficiently with the use of a kernel, a measure of similarity between two data points. The combination of these ideas led to the support vector machine, an efficient algorithm with high performance. In this thesis, we design an algorithm to learn the categorisation of data into multiple classes. This algorithm is applied to a real-time vision task, the recognition of human faces. Our algorithm can be seen as a generalisation of the support vector machine to multiple classes. It is shown how the algorithm can be efficiently implemented. To avoid a large number of small but time consuming updates of the variables limited accuracy computations are used. We prove a bound on the accuracy needed to find a solution. The proof motivates the use of a heuristic, which further increases efficiency. We derive a second implementation using a stochastic gradient descent method. This implementation is appealing as it has a direct interpretation and can be used in an online setting. Conceptually our approach differs from standard support vector approaches because examples can be rejected and are not necessarily attributed to one of the categories. This is natural in the context of a vision task. At any time, the sensory input can be something unseen before and hence cannot be recognised. Our visual data are images acquired with the recently developed adaptive vision sensor from CSEM. The vision sensor has two important features. First, like the human retina, it is locally adaptive to light intensity. Hence, the sensor has a high dynamic range. Second, the image gradient is computed on the sensor chip and is thus available directly from the sensor in real time. The sensor output is time encoded. The information about a strong local contrast is transmitted rst and the weakest contrast information at the end. To recognise faces, possibly moving in front of the camera, the sensor images have to be processed in a robust way. Representing images to exhibit local invariances is a common yet unsolved problem in computer vision. We develop the following representation of the sensor output. The image gradient information is decomposed into local histograms over contrast intensity. The histograms are local in position and direction of the gradient. Hence, the representation has local invariance properties to translation, rotation, and scaling. The histograms can be efficiently computed because the sensor output is already ordered with respect to the local contrast. Our support vector approach for multicategorical data uses the local histogram features to learn the recognition of faces. As recognition is time consuming, a face detection stage is used beforehand. We learn the detection features in an unsupervised manner using a specially designed optimisation procedure. The combined system to detect and recognise faces of a small group of individuals is efficient, robust, and reliable.

Pascal Fua, Weizhe Liu, Mathieu Salzmann

Modern methods for counting people in crowded scenes rely on deep networks to estimate people densities in individual images. As such, only very few take advantage of temporal consistency in video sequences, and those that do only impose weak smoothness constraints across consecutive frames. In this paper, we advocate estimating people flows across image locations between consecutive images and inferring the people densities from these flows instead of directly regressing them. This enables us to impose much stronger constraints encoding the conservation of the number of people. As a result, it significantly boosts performance without requiring a more complex architecture. Furthermore, it allows us to exploit the correlation between people flow and optical flow to further improve the results. We also show that leveraging people conservation constraints in both a spatial and temporal manner makes it possible to train a deep crowd counting model in an active learning setting with much fewer annotations. This significantly reduces the annotation cost while still leading to similar performance to the full supervision case.

2021The trends in the design of image sensors are to build sensors with low noise, high sensitivity, high dynamic range, and small pixel size. How can we benefit from pixels with small size and high sensitivity? In this dissertation, we study a new image sensor that is reminiscent of traditional photographic film. Each pixel in the sensor has a binary response, giving only a one-bit quantized measurement of the local light intensity. The response function of the image sensor is non-linear and similar to a logarithmic function, which makes the sensor suitable for high dynamic range imaging. We first formulate the oversampled binary sensing scheme as a parameter estimation problem based on quantized Poisson statistics. We show that, with a single-photon quantization threshold and large oversampling factors, the Cramér-Rao lower bound (CRLB) of the estimation variance approaches that of an ideal unquantized sensor, that is, as if there were no quantization in the sensor measurements. Furthermore, the CRLB is shown to be asymptotically achievable by the maximum likelihood estimator (MLE). By showing that the log-likelihood function is concave, we guarantee the global optimality of iterative algorithms in finding the MLE. We study the performance of the oversampled binary sensing scheme in presence of dark current noise. The noise model is an additive Bernoulli noise with a known parameter, and the noise only flips the binary output from "0" to "1". We show that the binary sensor is quite robust with respect to noise and its dynamic range is only slightly reduced. The binary sensor first benefits from the increasing of the oversampling factor and then suffers in term of dynamic range. We again use the MLE to estimate the light intensity. When the threshold is a single photon, we show that the log-likelihood function is still concave. Thus, the global optimality can be achieved. But for thresholds larger than "1", this property does not hold true. By proving that when the light intensity is piecewise-constant, the likelihood function is a strictly pseudoconcave function, we guarantee to find the optimal solution of the MLE using iterative algorithms for arbitrary thresholds. For the general linear light field model, the log-likelihood function is not even quasiconcave when thresholds are larger than "1". In this circumstance, we find an initial solution by approximating the light intensity field with a piecewise-constant model, and then we use Newton's method to refine the estimation using the exact model. We then examine one of the most important parameters in the binary sensor, i.e., the threshold used to generate binary values. We prove the intuitive result that large thresholds achieve better estimation performance for strong light intensities, while small thresholds work better for low light intensities. To make a binary sensor that works in a larger range of light intensities, we propose to design a threshold array containing multiple thresholds instead of a single threshold for the binary sensing. The criterion is to minimize the average CRLB which is a good approximation of the mean squared error (MSE). The performance analysis on the new binary sensor verifies the effectiveness of our design. Again, the MLE is used for reconstructing the light intensity field from the binary measurements. By showing that the log-likelihood function is concave for arbitrary threshold arrays, we ensure that the iterative algorithms can find the optimal solution of the MLE. Finally, we study the reconstruction problem for the binary image sensor under a generalized piecewise-constant light intensity field model, which is quite useful when parameters like oversampling factors are unknown. We directly estimate light exposure values, i.e., the number of photons hitting on each pixel. We assume that the light exposure values are piecewise-constant and we use the MLE for the reconstruction. This optimization problem is solved by iteratively working out two subproblems. The first one is to find the optimal light exposure value for each segment, given the optimal segmentation of the binary measurements. The second one is to find the optimal segmentation of the binary measurements given the optimal light exposure values for each segment. Several algorithms are provided for solving this optimization problem. Dynamic programming can obtain the optimal solution for 1-D signals, but the computation is quite heavy. To reduce the burden of computation, we propose a greedy algorithm and a method based on pruning of binary trees or quadtrees.