**Are you an EPFL student looking for a semester project?**

Work with us on data science and visualisation projects, and deploy your project as an app on top of GraphSearch.

Publication# Tests of mutual independence among several random vectors using univariate and multivariate ranks of nearest neighbours

Abstract

Testing mutual independence among several random vectors of arbitrary dimensions is a challenging problem in Statistics, and it has gained considerable interest in recent years. In this article, we propose some nonparametric tests based on different notions of ranks of nearest neighbour. These proposed tests can be conveniently used for high dimensional data, even when the dimensions of the random vectors are larger than the sample size. We investigate the performance of these tests on several simulated and real data sets and also use them in identifying causal relationships among the random vectors. Our numerical results show that they can outperform state-of-the-art tests in a wide variety of examples.

Official source

This page is automatically generated and may contain information that is not correct, complete, up-to-date, or relevant to your search query. The same applies to every other page on this website. Please make sure to verify the information with EPFL's official sources.

Related concepts

Loading

Related publications

Loading

Related publications (2)

Related concepts (11)

Dimension

In physics and mathematics, the dimension of a mathematical space (or object) is informally defined as the minimum number of coordinates needed to specify any point within it. Thus, a line has a d

Data set

A data set (or dataset) is a collection of data. In the case of tabular data, a data set corresponds to one or more database tables, where every column of a table represents a particular variable, an

Sample size determination

Sample size determination is the act of choosing the number of observations or replicates to include in a statistical sample. The sample size is an important feature of any empirical study in which th

Loading

Loading

Popular clustering algorithms based on usual distance functions (e.g., the Euclidean distance) often suffer in high dimension, low sample size (HDLSS) situations, where concentration of pairwise distances and violation of neighborhood structure have adverse effects on their performance. In this article, we use a new data-driven dissimilarity measure, called MADD, which takes care of these problems. MADD uses the distance concentration phenomenon to its advantage, and as a result, clustering algorithms based on MADD usually perform well for high dimensional data. We establish it using theoretical as well as numerical studies. We also address the problem of estimating the number of clusters. This is a challenging problem in cluster analysis, and several algorithms are available for it. We show that many of these existing algorithms have superior performance in high dimensions when they are constructed using MADD. We also construct a new estimator based on a penalized version of the Dunn index and prove its consistency in the HDLSS asymptotic regime. Several simulated and real data sets are analyzed to demonstrate the usefulness of MADD for cluster analysis of high dimensional data.

Testing for equality of two high-dimensional distributions is a challenging problem, and this becomes even more challenging when the sample size is small. Over the last few decades, several graph-based two-sample tests have been proposed in the literature, which can be used for data of arbitrary dimensions. Most of these test statistics are computed using pairwise Euclidean distances among the observations. But, due to concentration of pairwise Euclidean distances, these tests have poor performance in many high-dimensional problems. Some of them can have powers even below the nominal level when the scale-difference between two distributions dominates the location-difference. To overcome these limitations, we introduce some new dissimilarity indices and use them to modify some popular graph-based tests. These modified tests use the distance concentration phenomenon to their advantage, and as a result, they outperform the corresponding tests based on the Euclidean distance in a wide variety of examples. We establish the high-dimensional consistency of these modified tests under fairly general conditions. Analyzing several simulated as well as real data sets, we demonstrate their usefulness in high dimension, low sample size situations.