**Êtes-vous un étudiant de l'EPFL à la recherche d'un projet de semestre?**

Travaillez avec nous sur des projets en science des données et en visualisation, et déployez votre projet sous forme d'application sur GraphSearch.

Concept# Simple random sample

Résumé

In statistics, a simple random sample (or SRS) is a subset of individuals (a sample) chosen from a larger set (a population) in which a subset of individuals are chosen randomly, all with the same probability. It is a process of selecting a sample in a random way. In SRS, each subset of k individuals has the same probability of being chosen for the sample as any other subset of k individuals. A simple random sample is an unbiased sampling technique. Simple random sampling is a basic type of sampling and can be a component of other more complex sampling methods.
Introduction
The principle of simple random sampling is that every set of items has the same probability of being chosen. For example, suppose N college students want to get a ticket for a basketball game, but there are only X < N tickets for them, so they decide to have a fair way to see who gets to go. Then, everybody is given a number in the range from 0 to N-1, and random numbers are generated, either electronic

Source officielle

Cette page est générée automatiquement et peut contenir des informations qui ne sont pas correctes, complètes, à jour ou pertinentes par rapport à votre recherche. Il en va de même pour toutes les autres pages de ce site. Veillez à vérifier les informations auprès des sources officielles de l'EPFL.

Publications associées

Chargement

Personnes associées

Chargement

Unités associées

Chargement

Concepts associés

Chargement

Cours associés

Chargement

Séances de cours associées

Chargement

Personnes associées (3)

Unités associées (3)

Publications associées (30)

Chargement

Chargement

Chargement

Concepts associés (8)

Échantillonnage (statistiques)

thumb|Exemple d'échantillonnage aléatoire
En statistique, l'échantillonnage désigne les méthodes de sélection d'un sous-ensemble d'individus (un échantillon) à l'intérieur d'une population pour esti

Statistique

La statistique est la discipline qui étudie des phénomènes à travers la collecte de données, leur traitement, leur analyse, l'interprétation des résultats et leur présentation afin de rendre ces don

Cluster sampling

In statistics, cluster sampling is a sampling plan used when mutually homogeneous yet internally heterogeneous groupings are evident in a statistical population. It is often used in marketing resea

Cours associés (29)

CS-401: Applied data analysis

This course teaches the basic techniques, methodologies, and practical skills required to draw meaningful insights from a variety of data, with the help of the most acclaimed software tools in the data science world (pandas, scikit-learn, Spark, etc.)

COM-622: Topics in information-theoretic cryptography

Information-theoretic methods and their application to secrecy & privacy. Perfect information-theoretic secrecy. Randomness extraction & privacy amplification. Secret key generation from common randomness. Measures of information leakage incl. differential privacy, perfect privacy, & mutual info.

MICRO-110: Design of experiments

This course provides an introduction to experimental statistics, including use of population statistics to characterize experimental results, use of comparison statistics and hypothesis testing to evaluate validity of experiments, and design, application, and analysis of multifactorial experiments

Séances de cours associées (62)

Deep neural networks have been empirically successful in a variety of tasks, however their theoretical understanding is still poor. In particular, modern deep neural networks have many more parameters than training data. Thus, in principle they should overfit the training samples and exhibit poor generalization to the complete data distribution. Counter intuitively however, they manage to achieve both high training accuracy and high testing accuracy. One can prove generalization using a validation set, however this can be difficult when training samples are limited and at the same time we do not obtain any information about why deep neural networks generalize well. Another approach is to estimate the complexity of the deep neural network. The hypothesis is that if a network with high training accuracy has high complexity it will have memorized the data, while if it has low complexity it will have learned generalizable patterns. In the first part of this thesis we explore Spectral Complexity, a measure of complexity that depends on combinations of norms of the weight matrices of the deep neural network. For a dataset that is difficult to classify, with no underlying model and/or no recurring pattern, for example one where the labels have been chosen randomly, spectral complexity has a large value, reflecting that the network needs to memorize the labels, and will not generalize well. Putting back the real labels, the spectral complexity becomes lower reflecting that some structure is present and the network has learned patterns that might generalize to unseen data. Spectral complexity results in vacuous estimates of the generalization error (the difference between the training and testing accuracy), and we show that it can lead to counterintuitive results when comparing the generalization error of different architectures. In the second part of the thesis we explore non-vacuous estimates of the generalization error. In Chapter 2 we analyze the case of PAC-Bayes where a posterior distribution over the weights of a deep neural network is learned using stochastic variational inference, and the generalization error is the KL divergence between this posterior and a prior distribution. We find that a common approximation where the posterior is constrained to be Gaussian with diagonal covariance, known as the mean-field approximation, limits significantly any gains in bound tightness. We find that, if we choose the prior mean to be the random network initialization, the generalization error estimate tightens significantly. In Chapter 3 we explore an existing approach to learning the prior mean, in PAC-Bayes, from the training set. Specifically, we explore differential privacy, which ensures that the training samples contribute only a limited amount of information to the prior, making it distribution and not training set dependent. In this way the prior should generalize well to unseen data (as it hasn't memorized individual samples) and at the same time any posterior distribution that is close to it in terms of the KL divergence will also exhibit good generalization.

Intelligence involves processing sensory experiences into representations useful for prediction. Understanding sensory experiences and building these contextual representations without prior knowledge of sensor models and environment is a challenging unsupervised learning problem. Current machine learning methods process new sensory data using prior knowledge defined by either domain knowledge or datasets. When datasets are not available, data acquisition is needed, though automating exploration in support of learning is still an unsolved problem. Here we develop a method that enables agents to efficiently collect data for learning a predictive sensor model-without requiring domain knowledge, human input, or previously existing data-using ergodicity to specify the data acquisition process. This approach is based entirely on data-driven sensor characteristics rather than predefined knowledge of the sensor model and its physical characteristics. We learn higher quality models with lower energy expenditure during exploration for data acquisition compared to competing approaches, including both random sampling and information maximization. In addition to applications in autonomy, our approach provides a potential model of how animals use their motor control to develop high quality models of their sensors (sight, sound, touch) before having knowledge of their sensor capabilities or their surrounding environment. Information-based search strategies are relevant for the learning of interacting agents dynamics and usually need predefined data. The authors propose a method to collect data for learning a predictive sensor model, without requiring domain knowledge, human input, or previously existing data.

Valentina Masarotto, Victor Panaretos, Yoav Zemel

Covariance operators are fundamental in functional data analysis, providing the canonical means to analyse functional variation via the celebrated Karhunen-Loeve expansion. These operators may themselves be subject to variation, for instance in contexts where multiple functional populations are to be compared. Statistical techniques to analyse such variation are intimately linked with the choice of metric on covariance operators, and the intrinsic infinite-dimensionality of these operators. In this paper, we describe the manifold-like geometry of the space of trace-class infinite-dimensional covariance operators and associated key statistical properties, under the recently proposed infinite-dimensional version of the Procrustes metric (Pigoli et al. Biometrika101, 409-422, 2014). We identify this space with that of centred Gaussian processes equipped with the Wasserstein metric of optimal transportation. The identification allows us to provide a detailed description of those aspects of this manifold-like geometry that are important in terms of statistical inference; to establish key properties of the Frechet mean of a random sample of covariances; and to define generative models that are canonical for such metrics and link with the problem of registration of warped functional data.

2019