Even if some numbers can be represented exactly by floating-point numbers and such numbers are called
machine numbers, performing floating-point arithmetic may lead to roundoff error in the final result.
Addition Machine addition consists of lining up the decimal points of the two numbers to be added, adding them, and then storing the result again as a floating-point number. The addition itself can be done in higher precision but the result must be rounded back to the specified precision, which may lead to roundoff error. Note that the addition of two floating-point numbers can produce roundoff error when their sum is an
order of magnitude greater than that of the larger of the two. • For example, consider a normalized floating-point number system with base 10 and precision 2. Then fl(62)=6.2 \times 10^{1} and fl(41) = 4.1 \times 10^{1}. Note that 62+41=103 but fl(103)=1.0 \times 10^{2}. There is a roundoff error of 103-fl(103)=3. This kind of error can occur alongside an absorption error in a single operation.
Multiplication In general, the product of two p-digit significands contains up to 2p digits, so the result might not fit in the significand. Thus roundoff error will be involved in the result. • For example, consider a normalized floating-point number system with the base \beta=10 and the significand digits are at most 2. Then fl(77) = 7.7 \times 10^{1} and fl(88) = 8.8 \times 10^{1}. Note that 77 \times 88=6776 but fl(6776) = 6.7 \times 10^{3} since there at most 2 significand digits. The roundoff error would be 6776 - fl(6776) = 6776 - 6.7 \times 10^{3}=76.
Division In general, the quotient of 2p-digit significands may contain more than p-digits.Thus roundoff error will be involved in the result. • For example, if the normalized floating-point number system above is still being used, then 1/3=0.333 \ldots but fl(1/3)=fl(0.333 \ldots)=3.3 \times 10^{-1}. So, the tail 0.333 \ldots - 3.3 \times 10^{-1}=0.00333 \ldots is cut off.
Subtraction Absorption also applies to subtraction. • For example, subtracting 2^{-60} from 1 in IEEE double precision as follows, \begin{align} 1.00\ldots 0 \times 2^{0} - 1.00\ldots 0 \times 2^{-60} &= \underbrace{1.00\ldots 0}_\text{60 bits} \times 2^{0} - \underbrace{0.00\ldots 01}_\text{60 bits} \times 2^{0}\\ &= \underbrace{0.11\ldots 1}_\text{60 bits}\times 2^{0}. \end{align} This is saved as \underbrace{1.00\ldots 0}_\text{53 bits}\times 2^{0} since round-to-nearest is used in IEEE standard. Therefore, 1-2^{-60} is equal to 1 in IEEE double precision and the roundoff error is -2^{-60}. The subtracting of two nearly equal numbers is called
subtractive cancellation. When the leading digits are cancelled, the result may be too small to be represented exactly and it will just be represented as 0. • For example, let |\epsilon| and the second definition of machine epsilon is used here. What is the solution to (1+\epsilon) - (1-\epsilon)? It is known that 1+\epsilon and 1-\epsilon are nearly equal numbers, and (1+\epsilon) - (1-\epsilon)=1+\epsilon-1+\epsilon=2\epsilon. However, in the floating-point number system, fl((1+\epsilon) - (1-\epsilon))=fl(1+\epsilon)-fl(1-\epsilon)=1-1=0. Although 2\epsilon is easily big enough to be represented, both instances of \epsilon have been rounded away giving 0. Even with a somewhat larger \epsilon, the result is still significantly unreliable in typical cases. There is not much faith in the accuracy of the value because the most uncertainty in any floating-point number is the digits on the far right. • For example, 1.99999 \times 10 ^{2}- 1.99998 \times 10^{2} = 0.00001\times10^{2} =1 \times 10^{-5}\times 10^{2}=1\times10^{-3}. The result 1\times10^{-3} is clearly representable, but there is not much faith in it. This is closely related to the phenomenon of
catastrophic cancellation, in which the two numbers are
known to be approximations. == Accumulation of roundoff error ==