An IEEE 754
format is a "set of representations of numerical values and symbols". A format may also include how the set is encoded. A floating-point format is specified by • a base (also called
radix)
b, which is either 2 (binary) or 10 (decimal) in IEEE 754; • a precision
p; • an exponent range from
emin to
emax, with
emin = 1 −
emax, or equivalently
emin = − (
emax − 1), for all IEEE 754 formats. A format comprises • Finite numbers, which can be described by three integers:
s = a
sign (zero or one),
c = a
significand (also called a
coefficient or
mantissa) having no more than
p digits when written in base
b (i.e., an integer in the range through 0 to
bp − 1), and
q = an
exponent such that
emin ≤
q +
p − 1 ≤
emax. The numerical value of such a finite number is . Moreover, there are two zero values, called
signed zeros: the sign bit specifies whether a zero is +0 (positive zero) or −0 (negative zero). • Two infinities: +∞ and −∞. • Two kinds of
NaN (not-a-number): a quiet NaN (qNaN) and a signaling NaN (sNaN). For example, if
b = 10,
p = 7, and
emax = 96, then
emin = −95, the significand satisfies 0 ≤
c ≤ , and the exponent satisfies . Consequently, the smallest non-zero positive number that can be represented is 1×10−101, and the largest is 9999999×1090 (9.999999×1096), so the full range of numbers is −9.999999×1096 through 9.999999×1096. The numbers −
b1−
emax and
b1−
emax (here, −1×10−95 and 1×10−95) are the smallest (in magnitude)
normal numbers; non-zero numbers between these smallest numbers are called
subnormal numbers.
Representation and encoding in memory Some numbers may have several possible floating-point representations. For instance, if
b = 10, and
p = 7, then −12.345 can be represented by −12345×10−3, −123450×10−4, and −1234500×10−5. However, for most operations, such as arithmetic operations, the result (value) does not depend on the representation of the inputs. For the decimal formats, any representation is valid, and the set of these representations is called a
cohort. When a result can have several representations, the standard specifies which member of the cohort is chosen. For the binary formats, the representation is made unique by choosing the smallest representable exponent allowing the value to be represented exactly. Further, the exponent is not represented directly, but a
bias is added so that the smallest representable exponent is represented as 1, with 0 used for subnormal numbers. For numbers with an exponent in the normal range (the exponent field being neither all ones nor all zeros), the leading bit of the significand will always be 1. Consequently, a leading 1 can be implied rather than explicitly present in the memory encoding, and under the standard the explicitly represented part of the significand will lie between 0 and 1. This rule is called
leading bit convention,
implicit bit convention, or
hidden bit convention. This rule allows the binary format to have an extra bit of precision. The leading bit convention cannot be used for the subnormal numbers as they have an exponent outside the normal exponent range and scale by the smallest represented exponent as used for the smallest normal numbers. Due to the possibility of multiple encodings (at least in formats called
interchange formats), a NaN may carry other information: a sign bit (which has no meaning, but may be used by some operations) and a
payload, which is intended for diagnostic information indicating the source of the NaN (but the payload may have other uses, such as
NaN-boxing).
Basic and interchange formats The standard defines five basic formats that are named for their numeric base and the number of bits used in their interchange encoding. There are three binary floating-point basic formats (encoded with 32, 64 or 128 bits) and two decimal floating-point basic formats (encoded with 64 or 128 bits). The
binary32 and
binary64 formats are the
single and
double formats of
IEEE 754-1985 respectively. A conforming implementation must fully implement at least one of the basic formats. The standard also defines
interchange formats, which generalize these basic formats. For the binary formats, the leading bit convention is required. The following table summarizes some of the possible interchange formats (including the basic formats). In the table above, integer values are exact, whereas values in decimal notation (e.g. 1.0) are rounded values. The minimum exponents listed are for normal numbers; the special
subnormal number representation allows even smaller (in magnitude) numbers to be represented with some loss of precision. For example, the smallest positive number that can be represented in binary64 is 2−1074; contributions to the −1074 figure include the
emin value −1022 and all but one of the 53 significand bits (2−1022 − (53 − 1) = 2−1074). Decimal digits is the precision of the format expressed in terms of an equivalent number of decimal digits. It is computed as
digits × log10
base. E.g. binary128 has approximately the same precision as a 34 digit decimal number. log10
MAXVAL is a measure of the range of the encoding. Its integer part is the largest exponent shown on the output of a value in scientific notation with one leading digit in the significand before the decimal point (e.g. 1.698 is near the largest value in binary32, 9.999999 is the largest value in decimal32). The binary32 (single) and binary64 (double) formats are two of the most common formats used today. The figure below shows the absolute precision for both formats over a range of values. This figure can be used to select an appropriate format given the expected value of a number and the required precision. An example of a layout for
32-bit floating point is and the
64 bit layout is similar.
Extended and extendable precision formats The standard specifies optional
extended and extendable precision formats, which provide greater precision than the basic formats. An extended precision format extends a basic format by using more precision and more exponent range. An extendable precision format allows the user to specify the precision and exponent range. An implementation may use whatever internal representation it chooses for such formats; all that needs to be defined are its parameters (
b,
p, and
emax). These parameters uniquely describe the set of finite numbers (combinations of sign, significand, and exponent for the given radix) that it can represent. The standard recommends that language standards provide a method of specifying
p and
emax for each supported base
b. The standard recommends that language standards and implementations support an extended format which has a greater precision than the largest basic format supported for each radix
b. For an extended format with a precision between two basic formats the exponent range must be as great as that of the next wider basic format. So for instance a 64-bit extended precision binary number must have an 'emax' of at least 16383. The
x87 80-bit extended format meets this requirement. The original
IEEE 754-1985 standard also had the concept of
extended formats, but without any mandatory relation between
emin and
emax. For example, the
Motorola 68881 80-bit format, where
emin = −
emax, was a conforming extended format, but it became non-conforming in the 2008 revision.
Interchange formats Interchange formats are intended for the exchange of floating-point data using a bit string of fixed length for a given format.
Binary For the exchange of binary floating-point numbers, interchange formats of length 16 bits, 32 bits, 64 bits, and any multiple of 32 bits ≥ 128 are defined. The 16-bit format is intended for the exchange or storage of small numbers (e.g., for graphics). The encoding scheme for these binary interchange formats is the same as that of IEEE 754-1985: a sign bit, followed by
w exponent bits that describe the exponent offset by a
bias, and
p − 1 bits that describe the significand. The width of the exponent field for a
k-bit format is computed as
w = round(4 log2(
k)) − 13. The existing 64- and 128-bit formats follow this rule, but the 16- and 32-bit formats have more exponent bits (5 and 8 respectively) than this formula would provide (3 and 7 respectively). As with IEEE 754-1985, the biased-exponent field is filled with all 1 bits to indicate either infinity (trailing significand field = 0) or a NaN (trailing significand field ≠ 0). For NaNs, quiet NaNs and signaling NaNs are distinguished by using the most significant bit of the trailing significand field exclusively, and the payload is carried in the remaining bits.
Decimal For the exchange of decimal floating-point numbers, interchange formats of any multiple of 32 bits are defined. As with binary interchange, the encoding scheme for the decimal interchange formats encodes the sign, exponent, and significand. Two different bit-level encodings are defined, and interchange is complicated by the fact that some external indicator of the encoding in use may be required. The two options allow the significand to be encoded as a compressed sequence of decimal digits using
densely packed decimal or, alternatively, as a
binary integer. The former is more convenient for direct hardware implementation of the standard, while the latter is more suited to software emulation on a binary computer. In either case, the set of numbers (combinations of sign, significand, and exponent) that may be encoded is identical, and
special values (±zero with the minimum exponent, ±infinity, quiet NaNs, and signaling NaNs) have identical encodings. == Rounding rules ==