Dithering and error diffusion When digitizing
continuous signals, such as sound waves, the overall effect of a number of measurements is more important than the accuracy of each individual measurement. In these circumstances,
dithering, and a related technique,
error diffusion, are normally used. A related technique called
pulse-width modulation is used to achieve analog type output from an inertial device by rapidly pulsing the power with a variable duty cycle.
Delta-sigma modulation is commonly used for converting between real-world signals and digital signals, which allows control of the frequency statistics of
quantization. Error diffusion tries to ensure the error, on average, is minimized. When dealing with a gentle slope from one to zero, the output would be zero for the first few terms until the sum of the error and the current value becomes greater than 0.5, in which case a 1 is output and the difference subtracted from the error so far.
Floyd–Steinberg dithering is a popular error diffusion procedure when digitizing images. As a one-dimensional example, suppose the numbers , , , and occur in order and each is to be rounded to a multiple of . In this case the cumulative sums, , , , and , are each rounded to a multiple of : , , , and . The first of these and the differences of adjacent values give the desired rounded values: , , , and .
Monte Carlo arithmetic Monte Carlo arithmetic is a technique in
Monte Carlo methods where the rounding is randomly up or down. Stochastic rounding can be used for Monte Carlo arithmetic, but in general, just rounding up or down with equal probability is more often used. Repeated runs will give a random distribution of results which can indicate the stability of the computation.
Exact computation with rounded arithmetic It is possible to use rounded arithmetic to evaluate the exact value of a function with integer domain and range. For example, if an integer is known to be a perfect square, its square root can be computed by converting to a floating-point value , computing the approximate square root of with floating point, and then rounding to the nearest integer . If is not too big, the floating-point round-off error in will be less than 0.5, so the rounded value will be the exact square root of . This is essentially why
slide rules could be used for exact arithmetic.
Double rounding Rounding a number twice in succession to different levels of precision, with the latter precision being coarser, is not guaranteed to give the same result as rounding once to the final precision except in the case of directed rounding. discuss the implications of double rounding when comparing data rounded to one decimal place to specification limits expressed using integers. In
Martinez v. Allstate and
Sendejo v. Farmers, litigated between 1995 and 1997, the insurance companies argued that double rounding premiums was permissible and in fact required. The US courts ruled against the insurance companies and ordered them to adopt rules to ensure single rounding. Some computer languages and the
IEEE 754-2008 standard dictate that in straightforward calculations the result should not be rounded twice. This has been a particular problem with Java, as it is designed to be run identically on different machines; special programming tricks have had to be used to achieve this with
x87 floating point. The Java language was changed to allow different results where the difference does not matter and require a
strictfp qualifier to be used when the results have to conform accurately; strict floating point was restored in Java 17. In some algorithms, an intermediate result is computed in a larger precision, then must be rounded to the final precision. Double rounding can be avoided by choosing an adequate rounding for the intermediate computation. This consists in avoiding to round to midpoints for the final rounding (except when the midpoint is exact). In binary arithmetic, the idea is to round the result toward zero, and set the least significant bit to 1 if the rounded result is inexact; this rounding is called
sticky rounding. Equivalently, it consists in returning the intermediate result when it is exactly representable, and the nearest floating-point number with an odd significand otherwise; this is why it is also known as
rounding to odd. A concrete implementation of this approach, for binary and decimal arithmetic, is implemented as
Rounding to prepare for shorter precision.
Rounding to prepare for shorter precision This rounding mode is used to avoid getting a potentially wrong result after
multiple roundings. This can be achieved if all roundings except the final one are done using rounding to prepare for shorter precision ("RPSP"), and only the final rounding uses the externally requested mode. With decimal arithmetic, final digits of 0 and 5 are avoided when the input is not representable exactly; if there is a choice between numbers with the least significant digit 0 or 1, 4 or 5, 5 or 6, 9 or 0, then the digit different from 0 or 5 shall be selected; otherwise, the choice is arbitrary. IBM defines that, in the latter case, a digit with the smaller magnitude shall be selected. RPSP can be applied with the step between two consequent roundings as small as a single digit (for example, rounding to 1/10 can be applied after rounding to 1/100). For example, when rounding to integer, • 20.0 is rounded to 20; • 20.01, 20.1, 20.9, 20.99, 21, 21.01, 21.9, 21.99 are rounded to 21 (avoiding a final 0); • 22.0, 22.1, 22.9, 22.99 are rounded to 22; • 24.0, 24.1, 24.9, 24.99 are rounded to 24 (avoiding a final 5); • 25.0 is rounded to 25; • 25.01, 25.1 are rounded to 26 (avoiding a final 5). In the example from "
Double rounding" section, rounding 9.46 to one decimal gives 9.4, which rounding to integer in turn gives 9. With binary arithmetic, this rounding is also called "round to odd" (not to be confused with "
round half to odd"). For example, when rounding to 1/4 (0.01 in binary), • ⇒ result is 2 (10.00 in binary) • ⇒ result is 2.25 (10.01 in binary) • ⇒ result is 2.5 (10.10 in binary) • ⇒ result is 2.75 (10.11 in binary) • ⇒ result is 3 (11.00 in binary) For correct results with binary arithmetic, each rounding step must remove at least 2 binary digits, otherwise, wrong results may appear. For example, • 3.125 RPSP to 1/4 ⇒ result is 3.25 • 3.25 RPSP to 1/2 ⇒ result is 3.5 • 3.5 round-half-to-even to 1 ⇒ result is 4 (wrong) If the erroneous middle step is removed, the final rounding to integer rounds 3.25 to the correct value of 3. RPSP is implemented in hardware in IBM
zSeries and
pSeries. In
Python module "Decimal",
Tcl module "math",
Haskell package "decimal-arithmetic", and possibly others, this mode is called ROUND_05UP or round05up.
Table-maker's dilemma William M. Kahan coined the term "The Table-Maker's Dilemma" for the unknown cost of rounding
transcendental functions: The
IEEE 754 floating-point standard guarantees that add, subtract, multiply, divide,
fused multiply–add, square root, and floating-point remainder will give the correctly rounded result of the infinite-precision operation. No such guarantee was given in the 1985 standard for more complex functions and they are typically only accurate to within the last bit at best. However, the 2008 standard guarantees that conforming implementations will give correctly rounded results which respect the active rounding mode; implementation of the functions, however, is optional. Using the
Gelfond–Schneider theorem and
Lindemann–Weierstrass theorem, many of the standard elementary functions can be proved to return
transcendental results, except on some well-known arguments; therefore, from a theoretical point of view, it is always possible to correctly round such functions. However, for an implementation of such a function, determining a limit for a given precision on how accurate results need to be computed, before a correctly rounded result can be guaranteed, may demand a lot of computation time or may be out of reach. In practice, when this limit is not known (or only a very large bound is known), some decision has to be made in the implementation (see below); but according to a probabilistic model, correct rounding can be satisfied with a very high probability when using an intermediate accuracy of up to twice the number of digits of the target format plus some small constant (after taking special cases into account). Some programming packages offer correct rounding. The
GNU MPFR package gives correctly rounded arbitrary precision results. Some other libraries implement elementary functions with correct rounding in
IEEE 754 double precision (binary64): •
IBM's
ml4j, which stands for
Mathematical Library for Java, written by
Abraham Ziv and Moshe Olshansky in 1999, correctly rounded to nearest only. This library was claimed to be portable, but only binaries for
PowerPC/
AIX,
SPARC/
Solaris and
x86/
Windows NT were provided. According to its documentation, this library uses a first step with an accuracy a bit larger than double precision, a second step based on
double-double arithmetic, and a third step with a 768-bit precision based on arrays of IEEE 754 double-precision floating-point numbers. • IBM's
Accurate portable mathematical library (abbreviated as APMathLib or just MathLib), also called libultim, in rounding to nearest only. This library uses up to 768 bits of working precision. It was included in the
GNU C Library in 2001, but the "slow paths" (providing correct rounding) were removed from 2018 to 2021. • CRlibm, written in the old Arénaire team (LIP,
ENS Lyon), first distributed in 2003. It supports the 4 rounding modes and is proved, using the knowledge of the hardest-to-round cases. More efficient than IBM MathLib. •
Sun Microsystems's libmcr of 2004, in the 4 rounding modes. For the difficult cases, this library also uses multiple precision, and the number of words is increased by 2 each time the Table-maker's dilemma occurs (with undefined behavior in the very unlikely event that some limit of the machine is reached). • The CORE-MATH project (2022) provides some correctly rounded functions in the 4 rounding modes for
x86-64 processors. Proved using the knowledge of the hardest-to-round cases. •
LLVM libc provides some correctly rounded functions in the 4 rounding modes. There exist
computable numbers for which a rounded value can never be determined no matter how many digits are calculated. Specific instances cannot be given but this follows from the undecidability of the
halting problem. For instance, if
Goldbach's conjecture is true but
unprovable, then the result of rounding the following value, ,
up to the next integer cannot be determined: either =1+10− where is the first even number greater than 4 which is not the sum of two primes, or =1 if there is no such number. The rounded result is 2 if such a number exists and 1 otherwise. The value before rounding can however be approximated to any given precision even if the conjecture is unprovable.
Interaction with string searches Rounding can adversely affect a string search for a number. For example, rounded to four digits is "3.1416" but a simple search for this string will not discover "3.14159" or any other value of rounded to more than four digits. In contrast, truncation does not suffer from this problem; for example, a simple string search for "3.1415", which is truncated to four digits, will discover values of truncated to more than four digits. ==History==