Concept

Bfloat16 floating-point format

Résumé
The bfloat16 (brain floating point) floating-point format is a computer number format occupying 16 bits in computer memory; it represents a wide dynamic range of numeric values by using a floating radix point. This format is a truncated (16-bit) version of the 32-bit IEEE 754 single-precision floating-point format (binary32) with the intent of accelerating machine learning and near-sensor computing. It preserves the approximate dynamic range of 32-bit floating-point numbers by retaining 8 exponent bits, but supports only an 8-bit precision rather than the 24-bit significand of the binary32 format. More so than single-precision 32-bit floating-point numbers, bfloat16 numbers are unsuitable for integer calculations, but this is not their intended use. Bfloat16 is used to reduce the storage requirements and increase the calculation speed of machine learning algorithms. The bfloat16 format was developed by Google Brain, an artificial intelligence research group at Google. It is utilized in Intel AI processors, such as Nervana NNP-L1000, Xeon processors (AVX-512 BF16 extensions), and Intel FPGAs, Google Cloud TPUs, and TensorFlow. ARMv8.6-A, AMD ROCm, CUDA, Apple's M2 and therefore A15 chips and later, also support the bfloat16 format. On these platforms, bfloat16 may also be used in mixed-precision arithmetic, where bfloat16 numbers may be operated on and expanded to wider data types. bfloat16 has the following format: Sign bit: 1 bit Exponent width: 8 bits Significand precision: 8 bits (7 explicitly stored, with an implicit leading bit), as opposed to 24 bits in a classical single-precision floating-point format The bfloat16 format, being a truncated IEEE 754 single-precision 32-bit float, allows for fast conversion to and from an IEEE 754 single-precision 32-bit float; in conversion to the bfloat16 format, the exponent bits are preserved while the significand field can be reduced by truncation (thus corresponding to round toward 0), ignoring the NaN special case. Preserving the exponent bits maintains the 32-bit float's range of ≈ 10−38 to ≈ 3 × 1038.
À propos de ce résultat
Cette page est générée automatiquement et peut contenir des informations qui ne sont pas correctes, complètes, à jour ou pertinentes par rapport à votre recherche. Il en va de même pour toutes les autres pages de ce site. Veillez à vérifier les informations auprès des sources officielles de l'EPFL.