Linear regressionIn statistics, linear regression is a linear approach for modelling the relationship between a scalar response and one or more explanatory variables (also known as dependent and independent variables). The case of one explanatory variable is called simple linear regression; for more than one, the process is called multiple linear regression. This term is distinct from multivariate linear regression, where multiple correlated dependent variables are predicted, rather than a single scalar variable.
Multivariate statisticsMultivariate statistics is a subdivision of statistics encompassing the simultaneous observation and analysis of more than one outcome variable, i.e., multivariate random variables. Multivariate statistics concerns understanding the different aims and background of each of the different forms of multivariate analysis, and how they relate to each other. The practical application of multivariate statistics to a particular problem may involve several types of univariate and multivariate analyses in order to understand the relationships between variables and their relevance to the problem being studied.
Total organic carbonTotal organic carbon (TOC) is an analytical parameter representing the concentration of organic carbon in a sample. TOC determinations are made in a variety of application areas. For example, TOC may be used as a non-specific indicator of water quality, or TOC of source rock may be used as one factor in evaluating a petroleum play. For marine surface sediments average TOC content is 0.5% in the deep ocean, and 2% along the eastern margins.
Colored dissolved organic matterColored dissolved organic matter (CDOM) is the optically measurable component of dissolved organic matter in water. Also known as chromophoric dissolved organic matter, yellow substance, and gelbstoff, CDOM occurs naturally in aquatic environments and is a complex mixture of many hundreds to thousands of individual, unique organic matter molecules, which are primarily leached from decaying detritus and organic matter. CDOM most strongly absorbs short wavelength light ranging from blue to ultraviolet, whereas pure water absorbs longer wavelength red light.
Dirichlet-multinomial distributionIn probability theory and statistics, the Dirichlet-multinomial distribution is a family of discrete multivariate probability distributions on a finite support of non-negative integers. It is also called the Dirichlet compound multinomial distribution (DCM) or multivariate Pólya distribution (after George Pólya). It is a compound probability distribution, where a probability vector p is drawn from a Dirichlet distribution with parameter vector , and an observation drawn from a multinomial distribution with probability vector p and number of trials n.
Single-linkage clusteringIn statistics, single-linkage clustering is one of several methods of hierarchical clustering. It is based on grouping clusters in bottom-up fashion (agglomerative clustering), at each step combining two clusters that contain the closest pair of elements not yet belonging to the same cluster as each other. This method tends to produce long thin clusters in which nearby elements of the same cluster have small distances, but elements at opposite ends of a cluster may be much farther from each other than two elements of other clusters.
Particulate organic matterParticulate organic matter (POM) is a fraction of total organic matter operationally defined as that which does not pass through a filter pore size that typically ranges in size from 0.053 millimeters (53 μm) to 2 millimeters. Particulate organic carbon (POC) is a closely related term often used interchangeably with POM. POC refers specifically to the mass of carbon in the particulate organic material, while POM refers to the total mass of the particulate organic matter.
Clustering high-dimensional dataClustering high-dimensional data is the cluster analysis of data with anywhere from a few dozen to many thousands of dimensions. Such high-dimensional spaces of data are often encountered in areas such as medicine, where DNA microarray technology can produce many measurements at once, and the clustering of text documents, where, if a word-frequency vector is used, the number of dimensions equals the size of the vocabulary.
Principal component analysisPrincipal component analysis (PCA) is a popular technique for analyzing large datasets containing a high number of dimensions/features per observation, increasing the interpretability of data while preserving the maximum amount of information, and enabling the visualization of multidimensional data. Formally, PCA is a statistical technique for reducing the dimensionality of a dataset. This is accomplished by linearly transforming the data into a new coordinate system where (most of) the variation in the data can be described with fewer dimensions than the initial data.
Correlation clusteringClustering is the problem of partitioning data points into groups based on their similarity. Correlation clustering provides a method for clustering a set of objects into the optimum number of clusters without specifying that number in advance. Cluster analysis In machine learning, correlation clustering or cluster editing operates in a scenario where the relationships between the objects are known instead of the actual representations of the objects.