**Are you an EPFL student looking for a semester project?**

Work with us on data science and visualisation projects, and deploy your project as an app on top of Graph Search.

Publication# On Optimal Two Sample Homogeneity Tests for Finite Alphabets

Abstract

Suppose we are given two independent strings of data from a known finite alphabet. We are interested in testing the null hypothesis that both the strings were drawn from the same distribution, assuming that the samples within each string are mutually independent. Among statisticians, the most popular solution for such a homogeneity test is the two sample chi-square test, primarily due to its ease of implementation and the fact that the limiting null hypothesis distribution of the associated test statistic is known and easy to compute. Although tests that are asymptotically optimal in error probability have been proposed in the information theory literature, such optimality results are not well-known and such tests are rarely used in practice. In this paper we seek to bridge the gap between theory and practice. We study two different optimal tests proposed by Shayevitz [1] and Gutman [2]. We first obtain a simplified structure of Shayevitz’s test and then obtain limiting distributions of the test statistics used in both the tests. These results provide guidelines for choosing thresholds that guarantee an approximate false alarm constraint for finite length observation sequences, thus making these tests easy to use in practice. The approximation accuracies are demonstrated using simulations. We argue that such homogeneity tests with provable optimality properties could potentially be better choices than the chi-square test in practice.

Official source

This page is automatically generated and may contain information that is not correct, complete, up-to-date, or relevant to your search query. The same applies to every other page on this website. Please make sure to verify the information with EPFL's official sources.

Related concepts (36)

Related publications (63)

Related MOOCs (9)

Chi-squared test

A chi-squared test (also chi-square or χ2 test) is a statistical hypothesis test used in the analysis of contingency tables when the sample sizes are large. In simpler terms, this test is primarily used to examine whether two categorical variables (two dimensions of the contingency table) are independent in influencing the test statistic (values within the table). The test is valid when the test statistic is chi-squared distributed under the null hypothesis, specifically Pearson's chi-squared test and variants thereof.

Pearson's chi-squared test

Pearson's chi-squared test () is a statistical test applied to sets of categorical data to evaluate how likely it is that any observed difference between the sets arose by chance. It is the most widely used of many chi-squared tests (e.g., Yates, likelihood ratio, portmanteau test in time series, etc.) – statistical procedures whose results are evaluated by reference to the chi-squared distribution. Its properties were first investigated by Karl Pearson in 1900.

Test statistic

A test statistic is a statistic (a quantity derived from the sample) used in statistical hypothesis testing. A hypothesis test is typically specified in terms of a test statistic, considered as a numerical summary of a data-set that reduces the data to one value that can be used to perform the hypothesis test. In general, a test statistic is selected or defined in such a way as to quantify, within observed data, behaviours that would distinguish the null from the alternative hypothesis, where such an alternative is prescribed, or that would characterize the null hypothesis if there is no explicitly stated alternative hypothesis.

Advanced statistical physics

We explore statistical physics in both classical and open quantum systems. Additionally, we will cover probabilistic data analysis that is extremely useful in many applications.

Advanced statistical physics

We explore statistical physics in both classical and open quantum systems. Additionally, we will cover probabilistic data analysis that is extremely useful in many applications.

Plasma Physics: Introduction

Learn the basics of plasma, one of the fundamental states of matter, and the different types of models used to describe it, including fluid and kinetic.

Victor Panaretos, Yoav Zemel, Valentina Masarotto

We consider the problem of comparing several samples of stochastic processes with respect to their second-order structure, and describing the main modes of variation in this second order structure, if present. These tasks can be seen as an Analysis of Vari ...

Higher-order asymptotics provide accurate approximations for use in parametric statistical modelling. In this thesis, we investigate using higher-order approximations in two-specific settings, with a particular emphasis on the tangent exponential model. Th ...

Daniel Patrick Collins, Subhadeep Banik, Willi Meier

A near collision attack against the Grain v1 stream cipher was proposed by Zhang et al. in Eurocrypt 18. The attack uses the fact that two internal states of the stream cipher with very low hamming distance between them, produce similar keystream sequences ...

2023