**Êtes-vous un étudiant de l'EPFL à la recherche d'un projet de semestre?**

Travaillez avec nous sur des projets en science des données et en visualisation, et déployez votre projet sous forme d'application sur GraphSearch.

Concept# Exploratory data analysis

Résumé

In statistics, exploratory data analysis (EDA) is an approach of analyzing data sets to summarize their main characteristics, often using statistical graphics and other data visualization methods. A statistical model can be used or not, but primarily EDA is for seeing what the data can tell us beyond the formal modeling and thereby contrasts traditional hypothesis testing. Exploratory data analysis has been promoted by John Tukey since 1970 to encourage statisticians to explore the data, and possibly formulate hypotheses that could lead to new data collection and experiments. EDA is different from initial data analysis (IDA), which focuses more narrowly on checking assumptions required for model fitting and hypothesis testing, and handling missing values and making transformations of variables as needed. EDA encompasses IDA.
Overview
Tukey defined data analysis in 1961 as: "Procedures for analyzing data, techniques for interpreting the results of such procedures, ways of pl

Source officielle

Cette page est générée automatiquement et peut contenir des informations qui ne sont pas correctes, complètes, à jour ou pertinentes par rapport à votre recherche. Il en va de même pour toutes les autres pages de ce site. Veillez à vérifier les informations auprès des sources officielles de l'EPFL.

Publications associées

Chargement

Personnes associées

Chargement

Unités associées

Chargement

Concepts associés

Chargement

Cours associés

Chargement

Séances de cours associées

Chargement

Personnes associées (6)

Cours associés (50)

ENV-444: Exploratory data analysis in environmental health

This course teaches how to apply exploratory spatial data analysis to health data. Teaching focuses on the basics of spatial statistics and of epidemiology, and proposes a context to analyse geodatasets making it possible to study the relationship between health and the environment.

CS-401: Applied data analysis

This course teaches the basic techniques, methodologies, and practical skills required to draw meaningful insights from a variety of data, with the help of the most acclaimed software tools in the data science world (pandas, scikit-learn, Spark, etc.)

CS-423: Distributed information systems

This course introduces the key concepts and algorithms from the areas of information retrieval, data mining and knowledge bases, which constitute the foundations of today's Web-based distributed information systems.

Publications associées (76)

Chargement

Chargement

Chargement

Unités associées (3)

Concepts associés (28)

Visualisation de données

vignette|upright=2|Carte figurative des pertes successives en hommes de l'armée française dans la campagne de Russie 1812-1813, par Charles Minard, 1869.
La visualisation des données (ou dataviz ou r

Analyse des données

L’analyse des données (aussi appelée analyse exploratoire des données ou AED) est une famille de méthodes statistiques dont les principales caractéristiques sont d'être multidimensionnelles et descr

Exploration de données

L’exploration de données, connue aussi sous l'expression de fouille de données, forage de données, prospection de données, data mining, ou encore extraction de connaissances à partir de données, a po

Séances de cours associées (98)

This thesis is a contribution to financial statistics. One of the principal concerns of investors is the evaluation of portfolio risk. The notion of risk is vague, but in finance it is always linked to possible losses. In this thesis, we present some measures allowing the valuation of risk with the help of Bayesian methods. An exploratory analysis of data is presented to describe the sampling properties of financial time series. This analysis allows us to understand the origins of the daily returns studied in this thesis. Moreover, a discussion of different models is presented. These models make strong assumptions on investor behaviour, which are not always satisfied. This exploratory analysis shows some differences between the behaviour anticipated under equilibrium models, and that of real data. The Bayesian approach has been chosen because it allows one to incorporate all the variability, in particular that associated with model choice. The models studied in this thesis allow one to take heteroskedasticity into account, as well as particular shapes of the tails of returns. ARCH type models and models based on extreme value theory are studied. One original aspect of this thesis is its use of Bayesian analysis to detect change points in financial time series. We suppose that a market has two phases, and that it switches from a state to the other at random. Another new contribution is a model integrating heteroskedasticity and time dependence of extreme values, by superposition of the model proposed by Bortot and Coles (2003) and a GARCH process. This thesis uses simulation intensively for the estimation of risk measures. The drawback of simulation is the amount of time needed to obtain accurate estimates. However, simulation allows one to produce results when direct calculation is not feasible. For example, simulation allows one to compute risk estimates for time horizons greater than one day. The methods presented in this thesis are illustrated on simulated data, and on real data from European and American markets. This thesis involved the construction of a library containing C and S code to perform risk analysis using GARCH and extreme value theory models. The results show that model uncertainty can be incorporated, and that risk measures for time horizons greater than one can be obtained by simulation. The methods presented in this thesis have a natural representation involving conditioning. Thus, they permit the computation of both conditional and unconditional risk estimates. Three methods are described: the GARCH method; the two-state GARCH method; and the HBC method. Unconditional risk estimation using the GARCH method is satisfactory on data which seem stationary, but not reliable on data which are non-stationary, such as data with change points. The two-state GARCH model does a little better, but gives very satisfactory results when the risk is estimated conditionally on time. The HBC method does not give satisfactory results.

Scikit-HEP is a community-driven and community-oriented project with the goal of providing an ecosystem for particle physics data analysis in Python. Scikit-HEP is a toolset of approximately twenty packages and a few "a ffiliated" packages. It expands the typical Python data analysis tools for particle physicists. Each package focuses on a particular topic, and interacts with other packages in the toolset, where appropriate. Most of the packages are easy to install in many environments; much work has been done this year to provide binary "wheels" on PyPI and conda-forge packages. The Scikit-HEP project has been gaining interest and momentum, by building a user and developer community engaging collaboration across experiments. Some of the packages are being used by other communities, including the astroparticle physics community. An overview of the overall project and toolset will be presented, as well as a vision for development and sustainability.

Styliani Asimina Giannakopoulou

Data cleaning has become an indispensable part of data analysis due to the increasing amount of dirty data. Data scientists spend most of their time preparing dirty data before it can be used for data analysis. Existing solutions that attempt to automate the data cleaning procedure treat data cleaning as a separate offline process that takes place before analysis begins, while also focusing on a specific use-case. In addition, when the analysis involves complex non-relational data such as graphs, data cleaning becomes more challenging as it involves expensive operations. Therefore, offline, specialized cleaning tools exhibit long running times or fail to process large datasets. At the same time, applying data cleaning before analysis starts assumes a priori knowledge of the inconsistencies and the query workload, thereby requiring effort on understanding and cleaning data that is unnecessary for the analysis. Therefore, from a user's perspective, one is forced to use a different, potentially inefficient tool for each category of errors.In this thesis we aim for coverage and efficiency of data cleaning. We design and build data cleaning systems that employ high-level abstractions to (a) represent and optimize different cleaning operations for data of various formats, and (b) allow for real-time data cleaning that is relevant to data analysis.We introduce CleanM, a language that can express multiple types of cleaning operations. CleanM goes through a three-level translation process for optimization purposes; a different family of optimizations is applied in each abstraction level. Thus, CleanM can express complex data cleaning tasks, optimize them in a unified way, and deploy them in a scaleout fashion.To further reduce the data-to-insight time, we propose an approach that performs probabilistic repair of denial constraint violations on-demand, driven by the exploratory analysis that users perform. We introduce Daisy, a system that seamlessly integrates data cleaning into the analysis by relaxing query results. Daisy executes analytical query-workloads over dirty data by weaving cleaning operators into the query plan.To cover complex data such as graphs, we optimize the building block of data cleaning operations on graph data, that is subgraph matching. Subgraph matching constitutes the main bottleneck when cleaning graph data as it is an NP-complete problem. To optimize subgraph matching, we present a scale-up, radix-based algorithm that starts from an arbitrary partitioning of the graph and coordinates parallel pattern matching to eliminate redundant work among the workers. To address load imbalance, we employ a work stealing technique, specific to the subgraph matching problem. Worker threads steal work from straggler threads by using a heuristic that maximizes the work stolen, while at the same time preserves the order of evaluating candidate vertices.Overall, instead of using offline, specialised tools, this thesis designs abstractions that optimize different cleaning primitives over heterogeneous data, while also integrating data cleaning tasks seamlessly into data analysis. Specifically, we provide (a) a declarative high-level interface backed by an optimizable query calculus and (b) the required optimizations at the underlying data cleaning layers while also taking into consideration the exploratory analysis that users perform.