Data wranglingData wrangling, sometimes referred to as data munging, is the process of transforming and mapping data from one "raw" data form into another format with the intent of making it more appropriate and valuable for a variety of downstream purposes such as analytics. The goal of data wrangling is to assure quality and useful data. Data analysts typically spend the majority of their time in the process of data wrangling compared to the actual analysis of the data.
LexicographyLexicography is the study of lexicons, and is divided into two separate academic disciplines. It is the art of compiling dictionaries. Practical lexicography is the art or craft of compiling, writing and editing dictionaries. Theoretical lexicography is the scholarly study of semantic, orthographic, syntagmatic and paradigmatic features of lexemes of the lexicon (vocabulary) of a language, developing theories of dictionary components and structures linking the data in dictionaries, the needs for information by users in specific types of situations, and how users may best access the data incorporated in printed and electronic dictionaries.
Training, validation, and test data setsIn machine learning, a common task is the study and construction of algorithms that can learn from and make predictions on data. Such algorithms function by making data-driven predictions or decisions, through building a mathematical model from input data. These input data used to build the model are usually divided into multiple data sets. In particular, three data sets are commonly used in different stages of the creation of the model: training, validation, and test sets.
Pearson correlation coefficientIn statistics, the Pearson correlation coefficient (PCC) is a correlation coefficient that measures linear correlation between two sets of data. It is the ratio between the covariance of two variables and the product of their standard deviations; thus, it is essentially a normalized measurement of the covariance, such that the result always has a value between −1 and 1. As with covariance itself, the measure can only reflect a linear correlation of variables, and ignores many other types of relationships or correlations.
Data PreprocessingData preprocessing can refer to manipulation or dropping of data before it is used in order to ensure or enhance performance, and is an important step in the data mining process. The phrase "garbage in, garbage out" is particularly applicable to data mining and machine learning projects. Data collection methods are often loosely controlled, resulting in out-of-range values, impossible data combinations, and missing values, amongst other issues. Analyzing data that has not been carefully screened for such problems can produce misleading results.
Distributed computingA distributed system is a system whose components are located on different networked computers, which communicate and coordinate their actions by passing messages to one another. Distributed computing is a field of computer science that studies distributed systems. The components of a distributed system interact with one another in order to achieve a common goal. Three significant challenges of distributed systems are: maintaining concurrency of components, overcoming the lack of a global clock, and managing the independent failure of components.
Distributed generationDistributed generation, also distributed energy, on-site generation (OSG), or district/decentralized energy, is electrical generation and storage performed by a variety of small, grid-connected or distribution system-connected devices referred to as distributed energy resources (DER). Conventional power stations, such as coal-fired, gas, and nuclear powered plants, as well as hydroelectric dams and large-scale solar power stations, are centralized and often require electric energy to be transmitted over long distances.
Empirical risk minimizationEmpirical risk minimization (ERM) is a principle in statistical learning theory which defines a family of learning algorithms and is used to give theoretical bounds on their performance. The core idea is that we cannot know exactly how well an algorithm will work in practice (the true "risk") because we don't know the true distribution of data that the algorithm will work on, but we can instead measure its performance on a known set of training data (the "empirical" risk).
Efficient energy useEfficient energy use, sometimes simply called energy efficiency, is the process of reducing the amount of energy required to provide products and services. For example, insulating a building allows it to use less heating and cooling energy to achieve and maintain a thermal comfort. Installing light-emitting diode bulbs, fluorescent lighting, or natural skylight windows reduces the amount of energy required to attain the same level of illumination compared to using traditional incandescent light bulbs.