**Are you an EPFL student looking for a semester project?**

Work with us on data science and visualisation projects, and deploy your project as an app on top of GraphSearch.

Category# Statistics

Summary

Statistics (from German: Statistik, () "description of a state, a country") is the discipline that concerns the collection, organization, analysis, interpretation, and presentation of data. In applying statistics to a scientific, industrial, or social problem, it is conventional to begin with a statistical population or a statistical model to be studied. Populations can be diverse groups of people or objects such as "all people living in a country" or "every atom composing a crystal". Statistics deals with every aspect of data, including the planning of data collection in terms of the design of surveys and experiments.
When census data cannot be collected, statisticians collect data by developing specific experiment designs and survey samples. Representative sampling assures that inferences and conclusions can reasonably extend from the sample to the population as a whole. An experimental study involves taking measurements of the system under study, manipulating the system, and then taking additional measurements using the same procedure to determine if the manipulation has modified the values of the measurements. In contrast, an observational study does not involve experimental manipulation.
Two main statistical methods are used in data analysis: descriptive statistics, which summarize data from a sample using indexes such as the mean or standard deviation, and inferential statistics, which draw conclusions from data that are subject to random variation (e.g., observational errors, sampling variation). Descriptive statistics are most often concerned with two sets of properties of a distribution (sample or population): central tendency (or location) seeks to characterize the distribution's central or typical value, while dispersion (or variability) characterizes the extent to which members of the distribution depart from its center and each other. Inferences on mathematical statistics are made under the framework of probability theory, which deals with the analysis of random phenomena.

Official source

This page is automatically generated and may contain information that is not correct, complete, up-to-date, or relevant to your search query. The same applies to every other page on this website. Please make sure to verify the information with EPFL's official sources.

Related courses (197)

DH-406: Machine learning for DH

This course aims to introduce the basic principles of machine learning in the context of the digital humanities. We will cover both supervised and unsupervised learning techniques, and study and imple

MATH-413: Statistics for data science

Statistics lies at the foundation of data science, providing a unifying theoretical and methodological backbone for the diverse tasks enountered in this emerging field. This course rigorously develops

PHYS-467: Machine learning for physicists

Machine learning and data analysis are becoming increasingly central in sciences including physics. In this course, fundamental principles and methods of machine learning will be introduced and practi

Related concepts (698)

Related categories (322)

Related MOOCs (31)

Related people (237)

Related publications (273)

Related startups (6)

Related units (9)

Sample mean and covariance

The sample mean (sample average) or empirical mean (empirical average), and the sample covariance or empirical covariance are statistics computed from a sample of data on one or more random variables. The sample mean is the average value (or mean value) of a sample of numbers taken from a larger population of numbers, where "population" indicates not number of people but the entirety of relevant data, whether collected or not. A sample of 40 companies' sales from the Fortune 500 might be used for convenience instead of looking at the population, all 500 companies' sales.

Evidence-based medicine

Evidence-based medicine (EBM) is "the conscientious, explicit and judicious use of current best evidence in making decisions about the care of individual patients". The aim of EBM is to integrate the experience of the clinician, the values of the patient, and the best available scientific information to guide decision-making about clinical management. The term was originally used to describe an approach to teaching the practice of medicine and improving decisions by individual physicians about individual patients.

Anscombe's quartet

Anscombe's quartet comprises four data sets that have nearly identical simple descriptive statistics, yet have very different distributions and appear very different when graphed. Each dataset consists of eleven (x,y) points. They were constructed in 1973 by the statistician Francis Anscombe to demonstrate both the importance of graphing data when analyzing it, and the effect of outliers and other influential observations on statistical properties.

Selected Topics on Discrete Choice

Discrete choice models are used extensively in many disciplines where it is important to predict human behavior at a disaggregate level. This course is a follow up of the online course “Introduction t

Selected Topics on Discrete Choice

Discrete choice models are used extensively in many disciplines where it is important to predict human behavior at a disaggregate level. This course is a follow up of the online course “Introduction t

Neuronal Dynamics - Computational Neuroscience of Single Neurons

The activity of neurons in the brain and the code used by these neurons is described by mathematical neuron models at different levels of detail.

Related lectures (1,000)

, , , , , , , , ,

Mathematical statistics is the application of probability theory, a branch of mathematics, to statistics, as opposed to techniques for collecting statistical data. Specific mathematical techniques which are used for this include mathematical analysis, linear algebra, stochastic analysis, differential equations, and measure theory. Statistical data collection is concerned with the planning of studies, especially with the design of randomized experiments and with the planning of surveys using random sampling.

In statistical modeling, regression analysis is a set of statistical processes for estimating the relationships between a dependent variable (often called the 'outcome' or 'response' variable, or a 'label' in machine learning parlance) and one or more independent variables (often called 'predictors', 'covariates', 'explanatory variables' or 'features'). The most common form of regression analysis is linear regression, in which one finds the line (or a more complex linear combination) that most closely fits the data according to a specific mathematical criterion.

A statistical hypothesis test is a method of statistical inference used to decide whether the data at hand sufficiently support a particular hypothesis. Hypothesis testing allows us to make probabilistic statements about population parameters. While hypothesis testing was popularized early in the 20th century, early forms were used in the 1700s. The first use is credited to John Arbuthnot (1710), followed by Pierre-Simon Laplace (1770s), in analyzing the human sex ratio at birth; see .

Active in automation, data management and IoT. Thinkee provides a comprehensive software platform for automating process monitoring, data analysis, and intervention tracking.

Active in glaucoma management, smart contact lens and continuous ocular monitoring. SensiMed has developed the SENSIMED Triggerfish, a smart contact lens revolutionizing glaucoma management by providing a complete 24-hour picture of the eye, offering valuable insights for personalized treatment programs.

Active in bioprocessing, single-use solutions and sensor technology. gyMETRICS SA offers custom single-use solutions for biotech industries, specializing in innovative sensor technology.

Explores air pollution analysis using wind data, probability distributions, and trajectory models for air quality assessment.

Explores the Red bus/Blue bus paradox, nested logit models, and multivariate extreme value models in transportation.

Explores probability distributions for random variables in air pollution and climate change studies, covering descriptive and inferential statistics.

We provide a computationally and statistically efficient method for estimating the parameters of a stochastic covariance model observed on a regular spatial grid in any number of dimensions. Our proposed method, which we call the Debiased Spatial Whittle likelihood, makes important corrections to the well-known Whittle likelihood to account for large sources of bias caused by boundary effects and aliasing. We generalize the approach to flexibly allow for significant volumes of missing data including those with lower-dimensional substructure, and for irregular sampling boundaries. We build a theoretical framework under relatively weak assumptions which ensures consistency and asymptotic normality in numerous practical settings including missing data and non-Gaussian processes. We also extend our consistency results to multivariate processes. We provide detailed implementation guidelines which ensure the estimation procedure can be conducted in O(nlogn) operations, where n is the number of points of the encapsulating rectangular grid, thus keeping the computational scalability of Fourier and Whittle-based methods for large data sets. We validate our procedure over a range of simulated and realworld settings, and compare with state-of-the-art alternatives, demonstrating the enduring practical appeal of Fourier-based methods, provided they are corrected by the procedures developed in this paper.

This thesis focuses on non-parametric covariance estimation for random surfaces, i.e.~functional data on a two-dimensional domain. Non-parametric covariance estimation lies at the heart of functional data analysis, andconsiderations of statistical and computational efficiency often compel the use of separability of the covariance, when working with random surfaces. We seek to provide efficient alternatives to this ambivalent assumption.In Chapter 2, we study a setting where the covariance structure may fail to be separable locally -- either due to noise contamination or due to the presence of a non-separable short-range dependent signal component. That is, the covariance is an additive perturbation of a separable component by a non-separable but banded component. We introduce non-parametric estimators hinging on shifted partial tracing -- a novel concept enjoying strong denoising properties. We illustrate the usefulness of the proposed methodology on a data set of mortality surfaces.In Chapter 3, we propose a distinctive decomposition of the covariance, which allows us to understand separability as an unconventional form of low-rankness. From this perspective, a separable covariance has rank one. Allowing for a higher rank suggests a structured class in which any covariance can be approximated up to an arbitrary precision. The key notion of the partial inner product allows us to generalize the power iteration method to general Hilbert spaces and estimate the aforementioned decomposition from data. Truncation and retention of the leading terms automatically induces a non-parametric estimator of the covariance, whose parsimony is dictated by the truncation level. Advantages of this approach, allowing for estimation beyond separability, are demonstrated on the task of classification of EEG signals.While Chapters 2 and 3 propose several generalizations of separability in the densely sampled regime, Chapter 4 deals with the sparse regime, where the latent surfaces are observed only at few irregular locations. Here, a separable covariance estimator based on local linear smoothers is proposed, which is the first non-parametric utilization of separability in the sparse regime. The assumption of separability reduces the intrinsically four-dimensional smoothing problem into several two-dimensional smoothers and allows the proposed estimator to retain the classical minimax-optimal convergence rate for two-dimensional smoothers. The proposed methodology is used for a qualitative analysis of implied volatility surfaces corresponding to call options, and for prediction of the latent surfaces based on information from the entire data set, allowing for uncertainty quantification. Our quantitative results show that the proposed methodology outperforms the common approach of pre-smoothing every implied volatility surface separately.Throughout the thesis, we put emphasis on computational aspects, since those are the main reason behind the immense popularity of separability. We show that the covariance structures of Chapters 2 and 3 come with no (asymptotic) computational overhead relative to assuming separability. In fact, the proposed covariance structures can be estimated and manipulated with the same asymptotic costs as the separable model. In particular, we develop numerical algorithms that can be used for efficient inversion, as required e.g.~for prediction. All the methods are implemented in R and available on~GitHub.

Arthur Ulysse Jacot-Guillarmod

In the recent years, Deep Neural Networks (DNNs) have managed to succeed at tasks that previously appeared impossible, such as human-level object recognition, text synthesis, translation, playing games and many more. In spite of these major achievements, our understanding of these models, in particular of what happens during their training, remains very limited. This PhD started with the introduction of the Neural Tangent Kernel (NTK) to describe the evolution of the function represented by the network during training. In the infinite-width limit, i.e. when the number of neurons in the layers of the network grows to infinity, the NTK converges to a deterministic and time-independent limit, leading to a simple yet complete description of the dynamics of infinitely-wide DNNs. This allowed one to give the first general proof of convergence of DNNs to a global minimum, and yielded the first description of the limiting spectrum of the Hessian of the loss surface of DNNs throughout training.More importantly, the NTK plays a crucial role in describing the generalization abilities of DNNs, i.e. the performance of the trained network on unseen data. The NTK analysis uncovered a direct link between the function learned by infinitely wide DNNs and Kernel Ridge Regression predictors, whose generalization properties are studied in this thesis using tools of random matrix theory. Our analysis of KRR reveals the importance of the eigendecomposition of the NTK, which is affected by a number of architectural choices. In very deep networks, an ordered regime and a chaotic regime appear, determined by the choice of non-linearity and the balance between the weights and bias parameters; these two phases are characterized by different speeds of decay of the eigenvalues of the NTK, leading to a tradeoff between convergence speed and generalization. In practical contexts such as Generative Adversarial Networks or Topology Optimization, the network architecture can be chosen to guarantee certain properties of the NTK and its spectrum.These results give an almost complete description DNNs in this infinite-width limit. It is then natural to wonder how it extends to finite-width networks used in practice. In the so-called NTK regime, the discrepancy between finite- and infinite-widths DNNs is mainly a result of the variance w.r.t. to the sampling of the parameters, as shown empirically and mathematically relying on the similarity between DNNs and random feature models.In contrast to the NTK regime, where the NTK remains constant during training, there exist so-called active regimes, where the evolution of the NTK is significant, which appear in a number of settings. One such regime appears in Deep Linear Networks with a very small initialization, where the training dynamics approach a sequence of saddle-points, representing linear maps of increasing rank, leading to a low-rank bias which is absent in the NTK regime.