**Êtes-vous un étudiant de l'EPFL à la recherche d'un projet de semestre?**

Travaillez avec nous sur des projets en science des données et en visualisation, et déployez votre projet sous forme d'application sur GraphSearch.

Publication# Population Sensing Using Mobile Devices

Résumé

In our daily lives, our mobile phones sense our movements and interactions via a rich set of embedded sensors such as a GPS, Bluetooth, accelerometers, and microphones. This enables us to use mobile phones as agents for collecting spatio-temporal data. The idea of mining these spatio-temporal data is currently being explored for many applications, including environmental pollution monitoring, health care, and social networking. When used as sensing devices, a particular feature of mobile phones is their aspect of mobility, in contrast to static sensors. Furthermore, despite having useful applications, collecting data from mobile phones introduces privacy concerns, as the collected data might reveal sensitive information about the users, especially if the collector has access to auxiliary information. In the first part of this thesis, we use spatio-temporal data collected by mobile phones in order to evaluate different features of a population related to their mobility patterns. In particular, we consider the problems of population-size and population-density estimation that have applications, among others, in crowd monitoring, activity-hotspot detection, and urban analysis. We first conduct an experiment where ten attendees of an open-air music festival act as Bluetooth probes. Next, we construct parametric statistical models to estimate the total number of visible Bluetooth devices and their density in the festival area. We further test our proposed models against Wi-Fi traces obtained on the EPFL campus. We conclude that mobile phones can be effectively used as sensing devices to evaluate mobility-related parameters of a population. For the specific problem of population-density estimation, we investigate the mobility aspect of sensing: We quantitatively analyze the performance of mobile sensors compared to static sensors. Under an independent and identically distributed mobility model for the population, we derive the optimal random-movement strategy for mobile sensors in order to yield the best estimate of population density (in the mean-squared error sense). This enables us to plan an adaptive trajectory for the mobile sensors. In particular, we demonstrate that mobility brings an added value to the sensors; these sensors outperform static sensors for long observation intervals. In the second part of this thesis, we analyze the vulnerability of anonymized mobility statistics stored in the form of histograms. We consider an attacker who has access to an anonymized set of histograms of a set of users’ mobility traces and to an independent set of non-anonymized histograms of traces belonging to the same users. We study the hypothesis-testing problem of identifying the correct matching between the anonymized histograms and the non-anonymized histograms. We show that the solution can be obtained by using a minimum-weight matching algorithm on a complete weighted bipartite graph. By applying the algorithm to Wi-Fi traces obtained on the EPFL campus, we demonstrate that in fact anonymized histograms contain a significant amount of information that could be used to uniquely identify users by an attacker with access to auxiliary information about the users. Finally, we demonstrate how trust relationships between users can be exploited to enhance their privacy. We consider the specific problem of the privacy-preserving computation of functions of data that belong to users in a social network. An example of an application is a poll or a survey on a private issue. Most of the known non-cryptographic solutions to this problem can be viewed as belonging to one of the following two extreme regimes. The first regime is when every user trusts only herself and she is responsible for protecting her own privacy. In other words, the circle of trust of a user has a single member: herself. In the second regime, every user trusts herself and the server, but not any of the other users. In other words, the circle of trust of a user comprises herself and the server. We investigate this problem under the assumption that users are willing to share their private data with trusted friends in the network, hence we consider a regime in which the circle of trust of a user consists of herself and her friends. Thus, our approach falls in-between the two mentioned regimes. Our algorithm consists of first partitioning users into circles of trust and then computing the global function by using results of local computations within each circle. We demonstrate that such trust relationships can be exploited to significantly improve the tradeoff between the privacy of users' data and the accuracy of the computation.

Official source

Cette page est générée automatiquement et peut contenir des informations qui ne sont pas correctes, complètes, à jour ou pertinentes par rapport à votre recherche. Il en va de même pour toutes les autres pages de ce site. Veillez à vérifier les informations auprès des sources officielles de l'EPFL.

Concepts associés

Chargement

Publications associées

Chargement

Concepts associés (27)

Bluetooth

vignette|Logo de Bluetooth.
Bluetooth est une norme de télécommunications permettant l'échange bidirectionnel de données à courte distance en utilisant des ondes radio UHF sur la bande de fréquence d

Donnée

Une donnée est ce qui est connu et qui sert de point de départ à un raisonnement ayant pour objet la détermination d'une solution à un problème en relation avec cette donnée. Cela peut être une descr

Computation

A computation is any type of arithmetic or non-arithmetic calculation that is well-defined. Common examples of computations are mathematical equations and computer algorithms.
Mechanical or electron

Publications associées (104)

Chargement

Chargement

Chargement

Recent advances in data processing and communication systems have led to a continuous increase in the amount of data communicated over today’s networks. These large volumes of data pose new challenges on the current networking infrastructure that only offers a best effort mechanism for data delivery. The emergence of new distributed network architectures, such as peer-to-peer networks and wireless mesh networks, and the need for efficient data delivery mechanisms have motivated researchers to reconsider the way that information is communicated and processed in the networks. This has given rise to a new research field called network coding. The network coding paradigm departs from the traditional routing principle where information is simply relayed by the network nodes towards the destination, and introduces some intelligence in the network through coding at the intermediate nodes. This in-network data processing has been proved to substantially improve the performance of data delivery systems in terms of throughput and error resilience in networks with high path diversity. Motivated by the promising results in the network coding research, we focus in this thesis on the design of network coding algorithms for simultaneous transmission of multiple data sources in overlay networks. We investigate several problems that arise in the context of inter-session network coding, namely (i) decoding delay minimization in inter-session network coding, (ii) distributed rate allocation for inter-session network coding and (iii) correlation-aware decoding of incomplete network coded data. We start by proposing a novel framework for data delivery from multiple sources to multiple clients in an overlay wireline network, where intermediate nodes employ randomized inter-session network coding. We consider networks with high resource diversity, which creates network coding opportunities with possibly large gains in terms of throughput, delay and error robustness. However, the coding operations in the intermediate nodes must be carefully designed in order to enable efficient data delivery. We look at the problem from the decoding delay perspective and design solutions that lead to a small decoding delay at clients through proper coding and rate allocation. We cast the optimization problem as a rate allocation problem, which seeks for the coding operations that minimize the average decoding delay in the client population. We demonstrate the validity of our algorithm through simulations in representative network topologies. The results show that an effective combination of intra- and inter-session network coding based on randomized linear coding permits to reach small decoding delays and to better exploit the available network resources even in challenging network settings. Next, we design a distributed rate allocation algorithm where the users decide locally how many intra- and inter-session network coded packets should be requested from the parent nodes in order to get minimal decoding delay. The capability to take coding decisions locally with only a partial knowledge of the network statistics is of crucial importance for applications where users are organized in dynamic overlay networks. We propose a receiver-driven communication protocol that operates in two rounds. First, the users request and obtain information regarding the network conditions and packet availability in their local neighborhood. Then, every user independently optimizes the rate allocation among different possible intra- and inter-session packet combinations to be requested from its parents. We also introduce the novel concept of equivalent flows, which permits to efficiently estimate the expected number of packets that are necessary for decoding and hence to simplify the rate allocation process. Experimental results indicate that our algorithm is capable of eliminating the bottlenecks and reducing the decoding delay of users with limited resources. We further investigate the application of the proposed distributed rate allocation algorithm to the transmission of video sequences and validate the performance of our system using the NS-3 simulator. The simulation results show that the proposed rate allocation algorithm is successful in improving the quality of the delivered video compared to intra-session network coding based solutions. Finally, we investigate the problem of decoding the source information from an incomplete set of network coded data with the help of source priors in a finite algebraic field. The inability to form a complete decoding system can be often caused by transmission losses or timing constraints imposed by the application. In this case, exact reconstruction of the source data by conventional algorithms such as Gaussian elimination is not feasible; however, partial recovery of the source data may still be possible, which can be useful in applications where approximate reconstruction is informative. We use the statistical characteristics of the source data in order to perform approximate decoding. We first analyze the performance of a hypothetical maximum a posteriori decoder, which recovers the source data from an incomplete set of network coded data given the joint statistics of the sources. We derive an upper bound on the probability of erroneous source sequence decoding as a function of the system parameters. We then propose a constructive solution to the approximate decoding problem and design an iterative decoding algorithm based on message passing, which jointly considers the network coding and the correlation constraints. We illustrate the performance of our decoding algorithm through extensive simulations on synthetic and real data sets. The results demonstrate that, even by using a simple correlation model expressed as a correlation noise between pairs of sources, the original source data can be partially decoded in practice from an incomplete set of network coded symbols. Overall, this thesis addresses several important issues related to the design of efficient data delivery methods with inter-session network coding. Our novel framework for decoding delay minimization can impact the development of practical inter-session network coding algorithms that are appropriate for applications with low delay requirements. Our rate allocation algorithms are able to exploit the high resource diversity of modern networking systems and represent an effective alternative in the development of distributed communication systems. Finally, our algorithm for data recovery from incomplete network coded data using correlation priors can contribute significantly to the improvement of the delivered data quality and provide new insights towards the design of joint source and network coding algorithms.

This thesis is devoted to information-theoretic aspects of community detection. The importance of community detection is due to the massive amount of scientific data today that describes relationships between items from a network, e.g., a social network. Items from this network can be inherently partitioned into a known number of communities, but the partition can only be inferred from the data.
To estimate the underlying partition, data scientists can apply any type of advanced statistical techniques; but the data could be very noisy, or the number of data is inadequate. A fundamental question here is about the possibility of weak recovery: does the data contain a sufficient amount of information that enables us to produce a non-trivial estimate of the partition?
For the purpose of mathematical analysis, the above problem can be formulated as Bayesian inference on generative models. These models, including the stochastic block model (SBM) and censored block model (CBM), consider a random graph generated based on a hidden partition that divides the nodes in the graph into labelled groups. In the SBM, nodes are connected with a probability depending on the labels of the endpoints. Whereas, in the CBM, hidden variables are measured through a noisy channel, and the measurement outcomes form a weighted graph. In both models, inference is the task of recovering the hidden partition from the observed graph. The criteria for weak recovery can be studied via an information-theoretic quantity called mutual information. Once the asymptotic mutual information is computed, phase transitions for the weak recovery can be located.
This thesis pertains to rigorous derivations of single-letter variational expressions for the asymptotic mutual information for models in community detection. These variational expressions, known as the replica predictions, come from heuristic methods of statistical physics. We present our development of new rigorous methods for confirming the replica predictions. These methods are based on extending the recently introduced adaptive interpolation method.
We prove the replica prediction for the SBM in the dense-graph regime with two groups of asymmetric size. The existing proofs in the literature are indirect, as they involve mapping the model to an external problem whose mutual information is determined by a combination of methods. Here, on the contrary, we provide a self-contained and direct proof.
Next, we extend this method to sparse models. Before this thesis, adaptive interpolation was known for providing a conceptually simple proof for replica predictions for dense graphs. Whereas, for a sparse graph, the replica prediction involves a more complicated variational expression, and rigorous confirmations are often lacking or obtained by rather complicated methods. Therefore, we focus on a simple version of CBM on sparse graphs, where hidden variables are measured through a binary erasure channel, for which we fully prove the replica prediction by the adaptive interpolation.
The key for extending the adaptive interpolation to a broader class of sparse models is a concentration result for the so-called "multi-overlaps". This concentration forms the basis of the replica "symmetric" prediction. We prove this concentration result for a related sparse model in the context of physics. This provides inspiration for further development of the adaptive interpolation.

We live in a world characterized by massive information transfer and real-time communication. The demand for efficient yet low-complexity algorithms is widespread across different fields, including machine learning, signal processing and communications. Most of the problems that we encounter across these disciplines involves a large number of modules interacting with each other. It is therefore natural to represent these interactions and the flow of information between the modules in terms of a graph. This leads to the study of graph-based information processing framework. This framework can be used to gain insight into the development of algorithms for a diverse set of applications. We investigate the behaviour of large-scale networks (ranging from wireless sensor networks to social networks) as a function of underlying parameters. In particular, we study the scaling laws and applications of graph-based information processing in sensor networks/arrays, sparsity pattern recovery and interactive content search. In the first part of this thesis, we explore location estimation from incomplete information, a problem that arises often in wireless sensor networks and ultrasound tomography devices. In such applications, the data gathered by the sensors is only useful if we can pinpoint their positions with reasonable accuracy. This problem is particularly challenging when we need to infer the positions based on basic information/interaction such as proximity or incomplete (and often noisy) pairwise distances. As the sensors deployed in a sensor network are often of low quality and unreliable, we need to devise a mechanism to single out those that do not work properly. In the second part, we frame the network tomography problem as a well-studied inverse problem in statistics, called group testing. Group testing involves detecting a small set of defective items in a large population by grouping a subset of items into different pools. The result of each pool is a binary output depending on whether the pool contains a defective item or not. Motivated by the network tomography application, we consider the general framework of group testing with graph constraints. As opposed to conventional group testing where any subset of items can be grouped, here a test is admissible if it induces a connected subgraph. Given this constraint, we are interested in bounding the number of pools required to identify the defective items. Once the positions of sensors are known and the defective sensors are identified, we investigate another important feature of networks, namely, navigability or how fast nodes can deliver a message from one end to another by means of local operations. In the final part, we consider navigating through a database of objects utilizing comparisons. Contrary to traditional databases, users do not submit queries that are subsequently matched to objects. Instead, at each step, the database presents two objects to the user, who then selects among the pair the object closest to the target that she has in mind. This process continues until, based on the user’s answers, the database can identify the target she has in mind. The search through comparisons amounts to determining which pairs should be presented to the user in order to find the target object as quickly as possible. Interestingly, this problem has a natural connection with the navigability property studied in the second part, which enables us to develop efficient algorithms.