Half-precision floating-point format

Several earlier 16-bit floating point formats have existed including that of Hitachi's HD61810 DSP of 1982 (a 4-bit exponent and a 12-bit mantissa), the top 16 bits of a 32-bit float (8 exponent and 7 mantissa bits) called a bfloat16, and Thomas J. Scott's WIF of 1991 (5 exponent bits, 10 mantissa bits) and the 3dfx Voodoo Graphics processor of 1995 (same as Hitachi). ILM was searching for an image format that could handle a wide dynamic range, but without the hard drive and memory cost of single or double precision floating point. The hardware-accelerated programmable shading group led by John Airey at SGI (Silicon Graphics) used the s10e5 data type in 1997 as part of the "bali" design effort. This is described in a SIGGRAPH 2000 paper (see section 4.3) and further documented in US patent 7518615. It was popularized by its use in the open-source OpenEXR image format. Nvidia and Microsoft defined the half datatype in the Cg language, released in early 2002, and implemented it in silicon in the GeForce FX, released in late 2002. However, hardware support for accelerated 16-bit floating point was later dropped by Nvidia before being reintroduced in the Tegra X1 mobile GPU in 2015. The F16C extension in 2012 allows x86 processors to convert half-precision floats to and from single-precision floats with a machine instruction. == IEEE 754 half-precision binary floating-point format: binary16 == The IEEE 754 standard specifies a binary16 as having the following format: • Sign bit: 1 bit • Exponent width: 5 bits • Significand precision: 11 bits (10 explicitly stored) The format is laid out as follows: The format is assumed to have an implicit lead bit with value 1 unless the exponent field is stored with all zeros. Thus, only 10 bits of the significand appear in the memory format but the total precision is 11 bits. In IEEE 754 parlance, there are 10 bits of significand, but there are 11 bits of significand precision (log10(211) ≈ 3.311 decimal digits, or 4 digits ± slightly less than 5 units in the last place). Exponent encoding The half-precision binary floating-point exponent is encoded using an offset-binary representation, with the zero offset being 15; also known as exponent bias in the IEEE 754 standard. • Emin = 000012 − 011112 = −14 • Emax = 111102 − 011112 = 15 • Exponent bias = 011112 = 15 Thus, as defined by the offset binary representation, in order to get the true exponent the offset of 15 has to be subtracted from the stored exponent. The stored exponents 000002 and 111112 are interpreted specially. The minimum strictly positive (subnormal) value is 2−24 ≈ 5.96 × 10−8. The minimum positive normal value is 2−14 ≈ 6.10 × 10−5. The maximum representable value is (2−2−10) × 215 = 65504. Half precision examples These examples are given in bit representation of the floating-point value. This includes the sign bit, (biased) exponent, and significand. By default, 1/3 rounds down like for double precision, because of the odd number of bits in the significand. The bits beyond the rounding point are ... which is less than 1/2 of a unit in the last place. == ARM alternative half-precision ==