Concept

Quadruple-precision floating-point format

In computing, quadruple precision (or quad precision) is a binary floating point–based computer number format that occupies 16 bytes (128 bits) with precision at least twice the 53-bit double precision. This 128-bit quadruple precision is designed not only for applications requiring results in higher than double precision, but also, as a primary function, to allow the computation of double precision results more reliably and accurately by minimising overflow and round-off errors in intermediate calculations and scratch variables. William Kahan, primary architect of the original IEEE-754 floating point standard noted, "For now the 10-byte Extended format is a tolerable compromise between the value of extra-precise arithmetic and the price of implementing it to run fast; very soon two more bytes of precision will become tolerable, and ultimately a 16-byte format ... That kind of gradual evolution towards wider precision was already in view when IEEE Standard 754 for Floating-Point Arithmetic was framed." In IEEE 754-2008 the 128-bit base-2 format is officially referred to as binary128. The IEEE 754 standard specifies a binary128 as having: Sign bit: 1 bit Exponent width: 15 bits Significand precision: 113 bits (112 explicitly stored) This gives from 33 to 36 significant decimal digits precision. If a decimal string with at most 33 significant digits is converted to the IEEE 754 quadruple-precision format, giving a normal number, and then converted back to a decimal string with the same number of digits, the final result should match the original string. If an IEEE 754 quadruple-precision number is converted to a decimal string with at least 36 significant digits, and then converted back to quadruple-precision representation, the final result must match the original number. The format is written with an implicit lead bit with value 1 unless the exponent is stored with all zeros. Thus only 112 bits of the significand appear in the memory format, but the total precision is 113 bits (approximately 34 decimal digits: log10(2113) ≈ 34.

Source officielle

https://en.wikipedia.org/wiki/Quadruple-precision_floating-point_format

À propos de ce résultat

Cette page est générée automatiquement et peut contenir des informations qui ne sont pas correctes, complètes, à jour ou pertinentes par rapport à votre recherche. Il en va de même pour toutes les autres pages de ce site. Veillez à vérifier les informations auprès des sources officielles de l'EPFL.

Quadruple-precision floating-point format

Graph Chatbot

Chattez avec Graph Search

Towards General-Purpose Decentralized Computing with Permissionless Extensibility

Functional-Basis Analysis of Non-Stationary Signals in Modern Power Grids: Theory and Implementation in Embedded Systems

A 16-bit Floating-Point Near-SRAM Architecture for Low-power Sparse Matrix-Vector Multiplication

Towards General-Purpose Decentralized Computing with Permissionless Extensibility

Functional-Basis Analysis of Non-Stationary Signals in Modern Power Grids: Theory and Implementation in Embedded Systems

A 16-bit Floating-Point Near-SRAM Architecture for Low-power Sparse Matrix-Vector Multiplication