Floating-point arithmetic

Floating-point numbers A number representation specifies some way of encoding a number, usually as a string of digits. There are several mechanisms by which strings of digits can represent numbers. In standard mathematical notation, the digit string can be of any length, and the location of the radix point is indicated by placing an explicit "point" character (dot or comma) there. If the radix point is not specified, then the string implicitly represents an integer and the unstated radix point would be off the right-hand end of the string, next to the least significant digit. In fixed-point systems, a position in the string is specified for the radix point. So a fixed-point scheme might use a string of 8 decimal digits with the decimal point in the middle, whereby "00012345" would represent 0001.2345. In scientific notation, the given number is scaled by a power of 10, so that it lies within a specific range—typically between 1 and 10, with the radix point appearing immediately after the first digit. As a power of ten, the scaling factor is then indicated separately at the end of the number. For example, the orbital period of Jupiter's moon Io is seconds, a value that would be represented in standard-form scientific notation as seconds. Floating-point representation is similar in concept to scientific notation. Logically, a floating-point number consists of: • A signed (meaning positive or negative) digit string of a given length in a given radix (or base). This digit string is referred to as the significand, mantissa, or coefficient. The length of the significand determines the precision to which numbers can be represented. The radix point position is assumed always to be somewhere within the significand—often just after or just before the most significant digit, or to the right of the rightmost (least significant) digit. This article generally follows the convention that the radix point is set just after the most significant (leftmost) digit. • A signed integer exponent (also referred to as the characteristic, or scale), which modifies the magnitude of the number. To derive the value of the floating-point number, the significand is multiplied by the base raised to the power of the exponent, equivalent to shifting the radix point from its implied position by a number of places equal to the value of the exponent—to the right if the exponent is positive or to the left if the exponent is negative. Using base-10 (the familiar decimal notation) as an example, the number , which has ten decimal digits of precision, is represented as the significand together with 5 as the exponent. To determine the actual value, a decimal point is placed after the first digit of the significand and the result is multiplied by to give , or . In storing such a number, the base (10) need not be stored, since it will be the same for the entire range of supported numbers, and can thus be inferred. Symbolically, this final value is: \frac{s}{b^{\,p-1}} \times b^e, where is the significand (ignoring any implied decimal point), is the precision (the number of digits in the significand), is the base (in our example, this is the number ten), and is the exponent. Historically, several number bases have been used for representing floating-point numbers, with base two (binary) being the most common, followed by base ten (decimal floating point), and other less common varieties, such as base sixteen (hexadecimal floating point), base eight (octal floating point), base four (quaternary floating point), base three (balanced ternary floating point) and even base 256 and base . A floating-point number is a rational number, because it can be represented as one integer divided by another; for example is (145/100)×1000 or /100. The base determines the fractions that can be represented; for instance, 1/5 cannot be represented exactly as a floating-point number using a binary base, but 1/5 can be represented exactly using a decimal base (, or ). However, 1/3 cannot be represented exactly by either binary (0.010101...) or decimal (0.333...), but in base 3, it is trivial (0.1 or 1×3−1) . The occasions on which infinite expansions occur depend on the base and its prime factors. The way in which the significand (including its sign) and exponent are stored in a computer is implementation-dependent. The common IEEE formats are described in detail later and elsewhere, but as an example, in the binary single-precision (32-bit) floating-point representation, p = 24, and so the significand is a string of 24 bits. For instance, the number π's first 33 bits are: 11001001\ 00001111\ 1101101\underline{0}\ 10100010\ 0. In this binary expansion, let us denote the positions from 0 (leftmost bit, or most significant bit) to 32 (rightmost bit). The 24-bit significand will stop at position 23, shown as the underlined bit above. The next bit, at position 24, is called the round bit or rounding bit. It is used to round the 33-bit approximation to the nearest 24-bit number (there are specific rules for halfway values, which is not the case here). This bit, which is in this example, is added to the integer formed by the leftmost 24 bits, yielding: 11001001\ 00001111\ 1101101\underline{1}. When this is stored in memory using the IEEE 754 encoding, this becomes the significand . The significand is assumed to have a binary point to the right of the leftmost bit. So, the binary representation of π is calculated from left-to-right as follows: \begin{align} &\biggl(\sum_{n=0}^{p-1} \text{bit}_n \times 2^{-n}\biggr) \times 2^e \\ &\qquad{}= \left(1 \times 2^{-0} + 1 \times 2^{-1} + 0 \times 2^{-2} + 0 \times 2^{-3} + \cdots + 1 \times 2^{-23}\right) \times 2^1 \\[2mu] &\qquad{} \approx 1.57079637 \times 2 \\[3mu] &\qquad{} \approx 3.1415927 \end{align} where is the precision ( in this example), is the position of the bit of the significand from the left (starting at and finishing at here) and is the exponent ( in this example). It can be required that the most significant digit of the significand of a non-zero number be non-zero (except when the corresponding exponent would be smaller than the minimum one). This process is called normalization. For binary formats (which uses only the digits and ), this non-zero digit is necessarily . Therefore, it does not need to be represented in memory, allowing the format to have one more bit of precision. This rule is variously called the leading bit convention, the implicit bit convention, the hidden bit convention, or the assumed bit convention. Alternatives to floating-point numbers The floating-point representation is by far the most common way of representing in computers an approximation to real numbers. However, there are alternatives: • Fixed-point representation uses integer hardware operations controlled by a software implementation of a specific convention about the location of the binary or decimal point, for example, 6 bits or digits from the right. The hardware to manipulate these representations is less costly than floating point, and it can be used to perform normal integer operations, too. Binary fixed point is usually used in special-purpose applications on embedded processors that can only do integer arithmetic, but decimal fixed point is common in commercial applications. • Logarithmic number systems (LNSs) represent a real number by the logarithm of its absolute value and a sign bit. The value distribution is similar to floating point, but the value-to-representation curve (i.e., the graph of the logarithm function) is smooth (except at 0). Conversely to floating-point arithmetic, in a logarithmic number system multiplication, division and exponentiation are simple to implement, but addition and subtraction are complex. The (symmetric) level-index arithmetic (LI and SLI) of Charles Clenshaw, Frank Olver and Peter Turner is a scheme based on a generalized logarithm representation. • Tapered floating-point representation, used in Unum formats, including Posit. • Some simple rational numbers (e.g., 1/3 and 1/10) cannot be represented exactly in binary floating point, no matter what the precision is. Using a different radix allows one to represent some of them (e.g., 1/10 in decimal floating point), but the possibilities remain limited. Software packages that perform rational arithmetic represent numbers as fractions with integral numerator and denominator, and can therefore represent any rational number exactly. Such packages generally need to use "bignum" arithmetic for the individual integers. • Interval arithmetic allows one to represent numbers as intervals and obtain guaranteed bounds on results. It is generally based on other arithmetics, in particular floating point. • Computer algebra systems such as Mathematica, Maxima, and Maple can often handle irrational numbers like \pi or \sqrt{3} in a completely "formal" way (symbolic computation), without dealing with a specific encoding of the significand. Such a program can evaluate expressions like "\sin (3\pi)" exactly, because it is programmed to process the underlying mathematics directly, instead of using approximate values for each intermediate calculation. == History ==