**Are you an EPFL student looking for a semester project?**

Work with us on data science and visualisation projects, and deploy your project as an app on top of GraphSearch.

Publication# Bridging the gap between model-driven and data-driven methods in the era of Big Data

Abstract

Data-driven and model-driven methodologies can be regarded as competitive fields since they tackle similar problems such as prediction. However, these two fields can learn from each other to improve themselves. Indeed, data-driven methodologies have been developed to use advanced methodologies based on Big Data technologies. On the other hand, model-driven methodologies concentrate on developing mathematical models based on theory and expert knowledge to allow for interpretability and control. Through three main contributions, this thesis aims to bridge the gap between these two fields by using their strengths and applying them to its counterpart.Discrete Choice Models (DCMs) have shown tremendous success in many fields, such as transportation. However, they have not evolved to tackle the growing amount of available data. On the other hand, Machine Learning (ML) researchers have developed optimization algorithms to efficiently estimate complex models on large datasets. Similarly, faster estimation of DCMs on larger datasets would improve the efficiency of modelers as well as enable new research axes. Thus, we take inspiration from the large body of existing research in efficient parameter estimation with extensive data and large numbers of parameters in deep learning and apply it to DCMs. The first chapter of this thesis introduces the HAMABS algorithm, which combines three fundamental principles to enable faster parameter estimation of DCMs (20x speedup compared to standard estimation) without compromising the precision of the parameter estimates.Collecting large amounts of data can be cumbersome and costly, even in the era of Big Data. For example, ML researchers in Computer Vision have been developing generative deep learning models to augment datasets. DCM researchers face similar issues with tabular data, e.g. travel surveys. In addition, if the collection process is not performed correctly, these datasets can contain bias, lack consistency, or be unrepresentative of the actual population. The second chapter of this thesis introduces the DATGAN, a Generative Adversarial Network (GAN) integrating expert knowledge to control the generation process. This new architecture allows modelers to generate controlled and representative synthetic data, outperforming similar state-of-the-art generative models. Finally, researchers are increasingly developing fully disaggregate agent-based simulation models, which use detailed synthetic populations to generate aggregate passenger flows. However, detailed disaggregate socioeconomic data is usually expensive to collect and heavily restricted in terms of access and usage. As such, synthetic populations are typically either drawn randomly from aggregate level control totals, limiting their quality, or tightly controlled, limiting their application and usefulness. To combat this, the third chapter extends the DATGAN methodology to generate highly detailed and consistent synthetic populations from small sample data. First, ciDATGAN learns to generate the variables in a low-sample highly detailed dataset, e.g. household travel survey. It then completes a high-sample dataset with few variables, e.g. microdata census, by generating the previously learned variables. The results show that this methodology can correct for bias and may enable the transfer of synthetic populations to new areas/contexts.

Official source

This page is automatically generated and may contain information that is not correct, complete, up-to-date, or relevant to your search query. The same applies to every other page on this website. Please make sure to verify the information with EPFL's official sources.

Related concepts

Loading

Related publications

Loading

Related concepts (21)

Related publications (55)

Machine learning

Machine learning (ML) is an umbrella term for solving problems for which development of algorithms by human programmers would be cost-prohibitive, and instead the problems are solved by helping machin

Data

In common usage and statistics, data (USˈdætə; UKˈdeɪtə) is a collection of discrete or continuous values that convey information, describing the quantity, quality, fact, statistics, other basic uni

Data science

Data science is an interdisciplinary academic field that uses statistics, scientific computing, scientific methods, processes, algorithms and systems to extract or extrapolate knowledge and insight

Loading

Loading

Loading

Over the past few years, there have been fundamental breakthroughs in core problems in machine learning, largely driven by advances in deep neural networks. The amount of annotated data drastically increased and supervised deep discriminative models exceeded human-level performances in certain object detection tasks. The increasing availability in quantity and complexity of unlabelled data also opens up exciting possibilities for the development of unsupervised learning methods.
Among the family of unsupervised methods, deep generative models find numerous applications. Moreover, as real-world applications include high dimensional data, the ability of generative models to automatically learn semantically meaningful subspaces makes their advancement an essential step toward developing more efficient algorithms.
Generative Adversarial Networks (GANs) are a family of unsupervised generative algorithms that have demonstrated impressive performance for data synthesis and are now used in a wide range of computer vision tasks. Despite this success, they gained a reputation for being difficult to train, which results in a time-consuming and human-involved development process to use them. In the first part of this thesis, we focus on improving the stability and the performances of GANs.
Foremost, we consider an alternative training process to the standard one, named SGAN, in which several adversarial âlocalâ pairs of networks are trained independently so that a âglobalâ supervising pair of networks can be trained against them. Experimental results on both toy and real-world problems demonstrate that this approach outperforms standard training in terms of better mitigating mode collapse, stability while converging and that it surprisingly, increases the convergence speed as well.
To further reduce the computational footprint while maintaining the stability and performance advantages of SGAN, we focus on training a single pair of adversarial networks using variance reduced gradient. More precisely, we study the effect of the stochastic gradient noise on the training of generative adversarial networks (GANs) and show that it can prevent the convergence of standard game optimization methods, while the batch version converges. We address this issue with two stochastic variance-reduced gradient and extragradient optimization algorithms for GANs, named SVRG-GAN and SVRE, respectively. We observe empirically that SVRE performs similarly to a batch method on the MNIST dataset, while being computationally cheaper, and that SVRE yields more stable GAN training on standard datasets.
In the second part of the thesis we present our work on people detection. People detection methods are highly sensitive to occlusions between pedestrians, and using joint visual information from multiple synchronized cameras gives the opportunity to improve detection performance. We address the problem of multi-view people occupancy map estimation using an endâtoâend deep learning algorithm called DeepMCD that jointly utilizes the correlated streams of visual information. DeepMCD empirically outperformed the classical approaches by a large margin. Finally, we present a new large-scale and high-resolution dataset, named WILDTRACK. We provide an accurate joint calibration, as well as a series of benchmark results using baseline algorithms published over the recent months for multi-view detection with deep neural networks, and trajectory estimation using a non-Markovian model.

Machine Learning is a modern and actively developing field of computer science, devoted to extracting and estimating dependencies from empirical data. It combines such fields as statistics, optimization theory and artificial intelligence. In practical tasks, the general aim of Machine Learning is to construct algorithms able to generalize and predict in previously unseen situations based on some set of examples. Given some finite information, Machine Learning provides ways to exract knowledge, describe, explain and predict from data. Kernel Methods are one of the most successful branches of Machine Learning. They allow applying linear algorithms with well-founded properties such as generalization ability, to non-linear real-life problems. Support Vector Machine is a well-known example of a kernel method, which has found a wide range of applications in data analysis nowadays. In many practical applications, some additional prior knowledge is often available. This can be the knowledge about the data domain, invariant transformations, inner geometrical structures in data, some properties of the underlying process, etc. If used smartly, this information can provide significant improvement to any data processing algorithm. Thus, it is important to develop methods for incorporating prior knowledge into data-dependent models. The main objective of this thesis is to investigate approaches towards learning with kernel methods using prior knowledge. Invariant learning with kernel methods is considered in more details. In the first part of the thesis, kernels are developed which incorporate prior knowledge on invariant transformations. They apply when the desired transformation produce an object around every example, assuming that all points in the given object share the same class. Different types of objects, including hard geometrical objects and distributions are considered. These kernels were then applied for images classification with Support Vector Machines. Next, algorithms which specifically include prior knowledge are considered. An algorithm which linearly classifies distributions by their domain was developed. It is constructed such that it allows to apply kernels to solve non-linear tasks. Thus, it combines the discriminative power of support vector machines and the well-developed framework of generative models. It can be applied to a number of real-life tasks which include data represented as distributions. In the last part of the thesis, the use of unlabelled data as a source of prior knowledge is considered. The technique of modelling the unlabelled data with a graph is taken as a baseline from semi-supervised manifold learning. For classification problems, we use this apporach for building graph models of invariant manifolds. For regression problems, we use unlabelled data to take into account the inner geometry of the input space. To conclude, in this thesis we developed a number of approaches for incorporating some prior knowledge into kernel methods. We proposed invariant kernels for existing algorithms, developed new algorithms and adapted a technique taken from semi-supervised learning for invariant learning. In all these cases, links with related state-of-the-art approaches were investigated. Several illustrative experiments were carried out on real data on optical character recognition, face image classification, brain-computer interfaces, and a number of benchmark and synthetic datasets.

Machine Learning is a modern and actively developing field of computer science, devoted to extracting and estimating dependencies from empirical data. It combines such fields as statistics, optimization theory and artificial intelligence. In practical tasks, the general aim of Machine Learning is to construct algorithms able to generalize and predict in previously unseen situations based on some set of examples. Given some finite information, Machine Learning provides ways to exract knowledge, describe, explain and predict from data. Kernel Methods are one of the most successful branches of Machine Learning. They allow applying linear algorithms with well-founded properties such as generalization ability, to non-linear real-life problems. Support Vector Machine is a well-known example of a kernel method, which has found a wide range of applications in data analysis nowadays. In many practical applications, some additional prior knowledge is often available. This can be the knowledge about the data domain, invariant transformations, inner geometrical structures in data, some properties of the underlying process, etc. If used smartly, this information can provide significant improvement to any data processing algorithm. Thus, it is important to develop methods for incorporating prior knowledge into data-dependent models. The main objective of this thesis is to investigate approaches towards learning with kernel methods using prior knowledge. Invariant learning with kernel methods is considered in more details. In the first part of the thesis, kernels are developed which incorporate prior knowledge on invariant transformations. They apply when the desired transformation produce an object around every example, assuming that all points in the given object share the same class. Different types of objects, including hard geometrical objects and distributions are considered. These kernels were then applied for images classification with Support Vector Machines. Next, algorithms which specifically include prior knowledge are considered. An algorithm which linearly classifies distributions by their domain was developed. It is constructed such that it allows to apply kernels to solve non-linear tasks. Thus, it combines the discriminative power of support vector machines and the well-developed framework of generative models. It can be applied to a number of real-life tasks which include data represented as distributions. In the last part of the thesis, the use of unlabelled data as a source of prior knowledge is considered. The technique of modelling the unlabelled data with a graph is taken as a baseline from semi-supervised manifold learning. For classification problems, we use this apporach for building graph models of invariant manifolds. For regression problems, we use unlabelled data to take into account the inner geometry of the input space. To conclude, in this thesis we developed a number of approaches for incorporating some prior knowledge into kernel methods. We proposed invariant kernels for existing algorithms, developed new algorithms and adapted a technique taken from semi-supervised learning for invariant learning. In all these cases, links with related state-of-the-art approaches were investigated. Several illustrative experiments were carried out on real data on optical character recognition, face image classification, brain-computer interfaces, and a number of benchmark and synthetic datasets.