Computer number formatA computer number format is the internal representation of numeric values in digital device hardware and software, such as in programmable computers and calculators. Numerical values are stored as groupings of bits, such as bytes and words. The encoding between numerical values and bit patterns is chosen for convenience of the operation of the computer; the encoding used by the computer's instruction set generally requires conversion for external use, such as for printing and display.
Multiplieur-accumulateurEn programmation, à l'origine en traitement numérique du signal, l'opération combinée multiply–accumulate (MAC) ou multiply-add (MAD) est une instruction-machine qui calcule le produit de deux nombres et agrège le résultat au contenu d'un accumulateur. Le circuit électronique qui réalise cette opération est appelé « multiplieur-accumulateur » ; l'opération elle-même est souvent abrégée en MAC ou « opération MAC.
Decimal floating pointDecimal floating-point (DFP) arithmetic refers to both a representation and operations on decimal floating-point numbers. Working directly with decimal (base-10) fractions can avoid the rounding errors that otherwise typically occur when converting between decimal fractions (common in human-entered data, such as measurements or financial information) and binary (base-2) fractions. The advantage of decimal floating-point representation over decimal fixed-point and integer representation is that it supports a much wider range of values.
Catastrophic cancellationIn numerical analysis, catastrophic cancellation is the phenomenon that subtracting good approximations to two nearby numbers may yield a very bad approximation to the difference of the original numbers. For example, if there are two studs, one long and the other long, and they are measured with a ruler that is good only to the centimeter, then the approximations could come out to be and . These may be good approximations, in relative error, to the true lengths: the approximations are in error by less than 2% of the true lengths, .
Error analysis (mathematics)In mathematics, error analysis is the study of kind and quantity of error, or uncertainty, that may be present in the solution to a problem. This issue is particularly prominent in applied areas such as numerical analysis and statistics. In numerical simulation or modeling of real systems, error analysis is concerned with the changes in the output of the model as the parameters to the model vary about a mean. For instance, in a system modeled as a function of two variables Error analysis deals with the propagation of the numerical errors in and (around mean values and ) to error in (around a mean ).
Floating-point error mitigationFloating-point error mitigation is the minimization of errors caused by the fact that real numbers cannot, in general, be accurately represented in a fixed space. By definition, floating-point error cannot be eliminated, and, at best, can only be managed. Huberto M. Sierra noted in his 1956 patent "Floating Decimal Point Arithmetic Control Means for Calculator": Thus under some conditions, the major portion of the significant data digits may lie beyond the capacity of the registers.
Bfloat16 floating-point formatThe bfloat16 (brain floating point) floating-point format is a computer number format occupying 16 bits in computer memory; it represents a wide dynamic range of numeric values by using a floating radix point. This format is a truncated (16-bit) version of the 32-bit IEEE 754 single-precision floating-point format (binary32) with the intent of accelerating machine learning and near-sensor computing. It preserves the approximate dynamic range of 32-bit floating-point numbers by retaining 8 exponent bits, but supports only an 8-bit precision rather than the 24-bit significand of the binary32 format.
Truncation errorIn numerical analysis and scientific computing, truncation error is an error caused by approximating a mathematical process. A summation series for is given by an infinite series such as In reality, we can only use a finite number of these terms as it would take an infinite amount of computational time to make use of all of them. So let's suppose we use only three terms of the series, then In this case, the truncation error is Example A: Given the following infinite series, find the truncation error for x = 0.
Erreur d'approximationvignette|Approximation de la fonction exponentielle par une fonction affine. En analyse numérique, une branche des mathématiques, l'erreur d'approximation de certaines données est la différence entre une valeur exacte et une certaine valeur approchée ou approximation de celle-ci. Une erreur d'approximation peut se produire lorsque la mesure des données n'est pas précise (en raison des instruments) ; ou lors de l'emploi de valeurs approchées au lieu des valeurs exactes (par exemple, 3,14 au lieu de π).
Type punningIn computer science, a type punning is any programming technique that subverts or circumvents the type system of a programming language in order to achieve an effect that would be difficult or impossible to achieve within the bounds of the formal language. In C and C++, constructs such as pointer type conversion and union — C++ adds reference type conversion and reinterpret_cast to this list — are provided in order to permit many kinds of type punning, although some kinds are not actually supported by the standard language.