Division algorithm

A division algorithm is an algorithm which, given two integers N and D, computes their quotient and/or remainder, the result of Euclidean division. Some are applied by hand, while others are employed by digital circuit designs and software.

Division by repeated subtraction

The simplest division algorithm, historically incorporated into a greatest common divisor algorithm presented in Euclid's Elements, Book VII, Proposition 1, finds the remainder given two positive integers using only subtractions and comparisons: function divide_unsigned(N, D) if D = 0 then error(DivisionByZero) end R := N Q := 0 while R ≥ D do R := R − D Q := Q + 1 end return (Q, R) end The proof that the quotient and remainder exist and are unique (described at Euclidean division) gives rise to a complete division algorithm, applicable to both negative and positive numbers, using additions, subtractions, and comparisons: function divide(N, D) if D = 0 then error(DivisionByZero) end if D 0 return divide_unsigned(N, D) end This procedure always produces R ≥ 0. Although very simple, it takes Ω(Q) steps, and so is exponentially slower than even slow division algorithms like long division. It is useful if Q is known to be small (being an output-sensitive algorithm), and can serve as an executable specification. Alternative implementation An alternative implementation increments a remainder and resets it when it reaches the divisor. For x, y \in \mathbb{N}_0, the algorithm computes q, r \, such that x = qy + r \,, with 0 \le r : :\operatorname{div}(x, y) = \left\lfloor x / y \right\rfloor :x \bmod y = x - y \left\lfloor x / y \right\rfloor Consider this code in Python: def divide_unsigned2(numerator: int, denominator: int) -> tuple[int, int] quotient: int = 0 remainder: int = 0 for _ in range(numerator): remainder += 1 if remainder == denominator: quotient += 1 remainder = 0 return quotient, remainder Notes • Special cases are :: x \bmod 1 = 0 andx \bmod 0 = x. • Variable quotient is never read. So when its assignments highlighted are removed from the code and quotient is removed from the output list, divide_unsigned2, like divide_unsigned, still will compute x \bmod y. Alternative implementation as counter machine A simple counter machine (CM) can be based on the alternative implementation. The CM instructions are • Z (n): Replace rn by 0. • S (n): Add 1 to rn. • J (m, n, q): If rm = rn, jump to the qth instruction; otherwise go on to the next instruction in the program. The counter machine program is 1: J(1,5,0) 2: S(4) 3: J(4,2,6) 4: S(5) 5: J(0,0,1) 6: S(3) 7: Z(4) 8: S(5) 9: J(0,0,1) After the CM finishes the computation on initial register values R1=N, R2=D (remaining registers = 0), :: register R3 holds the (integer part of the) quotient of the division N/D, and :: register R4 holds the remainder. == Long division ==

Long division

Long division is the standard algorithm used for pen-and-paper division of multi-digit numbers expressed in decimal notation. It shifts gradually from the left to the right end of the dividend, subtracting the largest possible multiple of the divisor (at the digit level) at each stage; the multiples then become the digits of the quotient, and the final difference is then the remainder. When used with a binary radix, this method forms the basis for the (unsigned) integer division with remainder algorithm below. Short division is an abbreviated form of long division suitable for one-digit divisors. Chunking also known as the partial quotients method or the hangman method is a less-efficient form of long division which may be easier to understand. By allowing one to subtract more multiples than what one currently has at each stage, a more freeform variant of long division can be developed as well. Integer division (unsigned) with remainder The following algorithm, the binary version of the famous long division, will divide N by D, placing the quotient in Q and the remainder in R. In the following pseudo-code, all values are treated as unsigned integers. if D = 0 then error(DivisionByZeroException) end Q := 0 -- Initialize quotient and remainder to zero R := 0 for i := n − 1 .. 0 do -- Where n is number of bits in N R := R Example If we take N=11002 (1210) and D=1002 (410) Step 1: Set R=0 and Q=0 Step 2: Take i=3 (one less than the number of bits in N) Step 3: R=00 (left shifted by 1) Step 4: R=01 (setting R(0) to N(i)) Step 5: R Step 3: R=010 Step 4: R=011 Step 5: R Step 3: R=0110 Step 4: R=0110 Step 5: R>=D, statement entered Step 5b: R=10 (R−D) Step 5c: Q=10 (setting Q(i) to 1) Step 2: Set i=0 Step 3: R=100 Step 4: R=100 Step 5: R>=D, statement entered Step 5b: R=0 (R−D) Step 5c: Q=11 (setting Q(i) to 1) end Q=112 (310) and R=0. == Slow division methods ==

Slow division methods

Slow division methods are all based on a standard recurrence equation :R_{j+1} = B \times R_j - q_{n-(j+1)}\times D , where: • Rj is the j-th partial remainder of the division (R(0) := N(i) step is included) • B is the radix (base, usually 2 internally in computers and calculators) • q n − (j + 1) is the digit of the quotient in position n−(j+1), where the digit positions are numbered from least-significant 0 to most significant n−1 • n is number of digits in the quotient • D is the divisor Restoring division Restoring division operates on fixed-point fractional numbers and depends on the assumption 0 The quotient digits q are formed from the digit set {0,1}. The basic algorithm for binary (radix 2) restoring division is: R := N D := D = 0 then q(i) := 1 -- Result-bit 1 else q(i) := 0 -- Result-bit 0 R := R + D -- New partial remainder is (restored) shifted value end end -- Where: N = numerator, D = denominator, n = #bits, R = partial remainder, q(i) = bit #i of quotient Non-performing restoring division is similar to restoring division except that the value of 2R is saved, so D does not need to be added back in for the case of R < 0. Non-restoring division Non-restoring division uses the digit set {−1, 1} for the quotient digits instead of {0, 1}. The algorithm is more complex, but has the advantage when implemented in hardware that there is only one decision and addition/subtraction per quotient bit; there is no restoring step after the subtraction, which potentially cuts down the numbers of operations by up to half and lets it be executed faster. The basic algorithm for binary (radix 2) non-restoring division of non-negative numbers is: -- Inputs: N (Numerator), D (Denominator) -- n = number of bits -- R and D are usually stored in registers of width 2n or similar to handle shifts R := N -- Initialize Remainder for i = n − 1 .. 0 do -- for example 31..0 for 32 bits -- Shift Remainder Left (Algebraically: 2 * R) if R >= 0 then R := 2 * R - D; -- Subtract D q(i) := 1; -- Record quotient bit as 1 else R := 2 * R + D; -- Add D (restore) q(i) := -1; -- Record quotient bit as -1 end if end for Following this algorithm, the quotient is in a non-standard form consisting of digits of −1 and +1. This form needs to be converted to binary to form the final quotient. Example: If the −1 digits of Q are stored as zeros (0) as is common, then P is Q and computing M is trivial: perform a ones' complement (bit by bit complement) on the original Q. Q := Q − bit.bnot(Q) -- Appropriate if −1 digits in Q are represented as zeros as is common. Finally, quotients computed by this algorithm are always odd, and the remainder in R is in the range −D ≤ R < D. For example, 5 / 2 = 3 R −1. To convert to a positive remainder, do a single restoring step after Q is converted from non-standard form to standard form: if R The actual remainder is R >> n. (As with restoring division, the low-order bits of R are used up at the same rate as bits of the quotient Q are produced, and it is common to use a single shift register for both.) SRT division SRT division is a popular method for division in many microprocessor implementations. The algorithm is named after D. W. Sweeney of IBM, James E. Robertson of University of Illinois, and K. D. Tocher of Imperial College London. They all developed the algorithm independently at approximately the same time (published in February 1957, September 1958, and January 1958 respectively). SRT division is similar to non-restoring division, but it uses a lookup table based on the dividend and the divisor to determine each quotient digit. The most significant difference is that a redundant representation is used for the quotient. For example, when implementing radix-4 SRT division, each quotient digit is chosen from five possibilities: −2, −1, 0, +1, or +2. Because of this, the choice of a quotient digit need not be perfect; later quotient digits can correct for slight errors. (For example, the quotient digit pairs (0, +2) and (1, −2) are equivalent, since .) This tolerance allows quotient digits to be selected using only a few most-significant bits of the dividend and divisor, rather than requiring a full-width subtraction. This simplification in turn allows a radix higher than 2 to be used. Like non-restoring division, the final steps are a final full-width subtraction to resolve the last quotient bit, and conversion of the quotient to standard binary form. The original Intel Pentium processor's infamous floating-point division bug was caused by an incorrectly coded lookup table. Pentium processors used a 2048-cell table, of which 1066 cells were to be populated, and values from five cells were mistakenly omitted. == Fast division methods ==

Fast division methods

Newton–Raphson division Newton–Raphson uses Newton's method to find the reciprocal of D and multiply that reciprocal by N to find the The steps of Newton–Raphson division are: • Calculate an estimate X_0 for the reciprocal 1/D of the divisor D. • Compute successively more accurate estimates X_1,X_2,\ldots,X_S of the reciprocal. This is where one employs the Newton–Raphson method as such. • Compute the quotient by multiplying the dividend by the reciprocal of the divisor: Q = N X_S. In order to apply Newton's method to find the reciprocal of D, it is necessary to find a function f(X) that has a zero at X=1/D. The obvious such function is f(X)=DX-1, but the Newton–Raphson iteration for this is unhelpful, since it cannot be computed without already knowing the reciprocal of D (moreover it attempts to compute the exact reciprocal in one step, rather than allow for iterative improvements). A function that does work is f(X)=(1/X)-D, for which the Newton–Raphson iteration gives : X_{i+1} = X_i - {f(X_i)\over f'(X_i)} = X_i - {1/X_i - D\over -1/X_i^2} = X_i + X_i(1-DX_i) = X_i(2-DX_i), which can be calculated from X_i using only multiplication and subtraction, or using two fused multiply–adds. From a computation point of view, the expressions X_{i+1} = X_i + X_i(1-DX_i) and X_{i+1} = X_i(2-DX_i) are not equivalent. To obtain a result with a precision of 2n bits while making use of the second expression, one must compute the product between X_i and (2-DX_i) with double the given precision of X_i(n bits). In contrast, the product between X_i and (1-DX_i) need only be computed with a precision of n bits, because the leading n bits (after the binary point) of (1-DX_i) are zeros. If the error is defined as \varepsilon_i = 1 - D X_i, then: :\begin{align} \varepsilon_{i+1} &= 1 - D X_{i+1} \\ &= 1 - D (X_i(2-DX_i)) \\ &= 1 - 2DX_i + D^2X_i^2 \\ &= (1 - DX_i)^2 \\ &= {\varepsilon_i}^2. \\ \end{align} This squaring of the error at each iteration step the so-called quadratic convergence of Newton–Raphson's method has the effect that the number of correct digits in the result roughly doubles for every iteration, a property that becomes extremely valuable when the numbers involved have many digits (e.g. in the large integer domain). But it also means that the initial convergence of the method can be comparatively slow, especially if the initial estimate X_0 is poorly chosen. Initial estimate For the subproblem of choosing an initial estimate X_0, it is convenient to apply a bit-shift to the divisor D to scale it so that 0.5 ≤ D ≤ 1. Applying the same bit-shift to the numerator N ensures the quotient does not change. Once within a bounded range, a simple polynomial approximation can be used to find an initial estimate. The linear approximation with minimum worst-case absolute error on the interval [0.5,1] is: :X_0 = {48 \over 17} - {32 \over 17} D. The coefficients of the linear approximation T_0 +T_1 D are determined as follows. The absolute value of the error is |\varepsilon_0| = |1 - D(T_0 + T_1 D)|. The minimum of the maximum absolute value of the error is determined by the Chebyshev equioscillation theorem applied to F(D) = 1 - D(T_0 + T_1 D). The local minimum of F(D) occurs when F'(D) = 0, which has solution D = -T_0/(2T_1). The function at that minimum must be of opposite sign as the function at the endpoints, namely, F(1/2) = F(1) = -F(-T_0/(2T_1)). The two equations in the two unknowns have a unique solution T_0 = 48/17 and T_1 = -32/17, and the maximum error is F(1) = 1/17. Using this approximation, the absolute value of the error of the initial value is less than :\vert \varepsilon_0 \vert \leq {1 \over 17} \approx 0.059 . The best quadratic fit to 1/D in the interval is : X := \frac{140}{33} - \frac{64}{11} D + \frac{256}{99} D^2. It is chosen to make the error equal to a re-scaled third order Chebyshev polynomial of the first kind, and gives an absolute value of the error less than or equal to 1/99. This improvement is equivalent to \log_2(\log 99/\log 17) \approx 0.7 Newton–Raphson iterations, at a computational cost of less than one iteration. It is possible to generate a polynomial fit of degree larger than 2, computing the coefficients using the Remez algorithm. The trade-off is that the initial guess requires more computational cycles but hopefully in exchange for fewer iterations of Newton–Raphson. Since for this method the convergence is exactly quadratic, it follows that, from an initial error \varepsilon_0, S iterations will give an answer accurate to :P = -2^S \log_2 \varepsilon_0 - 1 = 2^S \log_2(1/\varepsilon_0) - 1 binary places. Typical values are: A quadratic initial estimate plus two iterations is accurate enough for IEEE single precision, but three iterations are marginal for double precision. A linear initial estimate plus four iterations is sufficient for both double and double extended formats. Pseudocode The following computes the quotient of and with a precision of binary places: Express D as M × 2e where 1 ≤ M D' := D / 2e+1 // scale between 0.5 and 1, can be performed with bit shift / exponent subtraction N' := N / 2e+1 X := 48/17 − 32/17 × D' // precompute constants with same precision as D {{nowrap|repeat \left \lceil \log_2 \frac{P + 1}{\log_2 17} \right \rceil \, times}} // can be precomputed based on fixed P X := X + X × (1 - D' × X) end return N' × X For example, for a double-precision floating-point division, this method uses 10 multiplies, 9 adds, and 2 shifts. Cubic iteration There is an iteration which uses three multiplications to cube the error: : \varepsilon_i = 1 - D X_i : Y_i = X_i \varepsilon_i : X_{i+1} = X_i + Y_i + Y_i \varepsilon_i . The Yiεi term is new. Expanding out the above, X_{i+1} can be written as :\begin{align} X_{i+1} &= X_i + X_i\varepsilon_i + X_i\varepsilon_i^2 \\ &= X_i + X_i(1-DX_i) + X_i(1-DX_i)^2 \\ &= 3X_i - 3DX_i^2 + D^2X_i^3 , \end{align} with the result that the error term :\begin{align} \varepsilon_{i+1} &= 1 - DX_{i+1} \\ &= 1 - 3DX_i + 3D^2X_i^2 - D^3X_i^3 \\ &= (1 - DX_i)^3 \\ &= \varepsilon_i^3 . \end{align} This is 3/2 the computation of the quadratic iteration, but achieves \log 3 / \log 2 \approx 1.585 as much convergence, so is slightly more efficient. Put another way, two iterations of this method raise the error to the ninth power at the same computational cost as three quadratic iterations, which only raise the error to the eighth power. The number of correct bits after S iterations is :P = -3^S \log_2 \varepsilon_0 - 1 = 3^S \log_2(1/\varepsilon_0) - 1 binary places. Typical values are: A quadratic initial estimate plus two cubic iterations provides ample precision for an IEEE double-precision result. It is also possible to use a mixture of quadratic and cubic iterations. Using at least one quadratic iteration ensures that the error is positive, i.e. the reciprocal is underestimated. This can simplify a following rounding step if an exactly-rounded quotient is required. Using higher degree polynomials in either the initialization or the iteration results in a degradation of performance because the extra multiplications required would be better spent on doing more iterations. Goldschmidt division Goldschmidt division (after Robert Elliott Goldschmidt) uses an iterative process of repeatedly multiplying both the dividend and divisor by a common factor Fi, chosen such that the divisor converges to 1. This causes the dividend to converge to the sought quotient Q: :Q = \frac{N}{D} \frac{F_1}{F_1} \frac{F_2}{F_2} \frac{F_\ldots}{F_\ldots}. The steps for Goldschmidt division are: • Generate an estimate for the multiplication factor Fi . • Multiply the dividend and divisor by Fi . • If the divisor is sufficiently close to 1, return the dividend, otherwise, loop to step 1. Assuming N/D has been scaled so that 0 i is based on D'': :F_{i+1} = 2 - D_i. Multiplying the dividend and divisor by the factor yields: :\frac{N_{i+1}}{D_{i+1}} = \frac{N_i}{D_i}\frac{F_{i+1}}{F_{i+1}}. After a sufficient number k of iterations Q=N_k. The Goldschmidt method is used in AMD Athlon CPUs and later models. It is also known as Anderson Earle Goldschmidt Powers (AEGP) algorithm and is implemented by various IBM processors. Although it converges at the same rate as a Newton–Raphson implementation, one advantage of the Goldschmidt method is that the multiplications in the numerator and in the denominator can be done in parallel. Binomial theorem The Goldschmidt method can be used with factors that allow simplifications by the binomial theorem. Assume has been scaled by a power of two such that D\in\left(\tfrac{1}{2},1\right]. We choose D = 1-x and F_{i} = 1+x^{2^i}. This yields : \frac{N}{1-x} = \frac{N\cdot(1+x)}{1-x^2} = \frac{N\cdot(1+x)\cdot(1+x^2)}{1-x^4} = \cdots = Q' = \frac{N' = N\cdot(1+x)\cdot(1+x^2)\cdot\cdot\cdot(1+x^{2^{(n-1)}})}{D' = 1-x^{2^n} \approx 1} . After steps \left( x\in\left[0,\tfrac{1}{2}\right) \right), the denominator 1-x^{2^n} can be rounded to with a relative error : \varepsilon_n = \frac{Q' - N'}{Q'} = x^{2^n} which is maximum at 2^{-2^n} when x = \tfrac{1}{2}, thus providing a minimum precision of 2^n binary digits. == Large-integer methods ==

Large-integer methods

Methods designed for hardware implementation generally do not scale to integers with thousands or millions of decimal digits; these frequently occur, for example, in modular reductions in cryptography. For these large integers, more efficient division algorithms transform the problem to use a small number of multiplications, which can then be done using an asymptotically efficient multiplication algorithm such as the Karatsuba algorithm, Toom–Cook multiplication or the Schönhage–Strassen algorithm. The result is that the computational complexity of the division is of the same order (up to a multiplicative constant) as that of the multiplication. Examples include reduction to multiplication by Newton's method as described above, as well as the slightly faster Burnikel-Ziegler division, Barrett reduction and Montgomery reduction algorithms. Newton's method is particularly efficient in scenarios where one must divide by the same divisor many times, since after the initial Newton inversion only one (truncated) multiplication is needed for each division. == Division by a constant ==

Division by a constant

The division by a constant D is equivalent to the multiplication by its reciprocal. Since the denominator is constant, so is its reciprocal (1/D). Thus it is possible to compute the value of (1/D) once at compile time, and at run time perform the multiplication N·(1/D) rather than the division N/D. In floating-point arithmetic the use of (1/D) presents little problem, but in integer arithmetic the reciprocal will always evaluate to zero (assuming |D| > 1). It is not necessary to use specifically (1/D); any value (X/Y) that reduces to (1/D) may be used. For example, for division by 3, the factors 1/3, 2/6, 3/9, or 194/582 could be used. Consequently, if Y were a power of two the division step would reduce to a fast right bit shift. The effect of calculating N/D as (N·X)/Y replaces a division with a multiply and a shift. Note that the parentheses are important, as N·(X/Y) will evaluate to zero. However, unless D itself is a power of two, there is no X and Y that satisfies the conditions above. Fortunately, (N·X)/Y gives exactly the same result as N/D in integer arithmetic even when (X/Y) is not exactly equal to 1/D, but "close enough" that the error introduced by the approximation is in the bits that are discarded by the shift operation. Barrett reduction uses powers of 2 for the value of Y to make division by Y a simple right shift. As a concrete fixed-point arithmetic example, for 32-bit unsigned integers, division by 3 can be replaced with a multiply by , a widening multiplication by 2863311531 (hexadecimal 0xAAAAAAAB) followed by a 33 right bit shift. The value of 2863311531 is calculated as , then rounded up. Likewise, division by 10 can be expressed as a multiplication by 3435973837 (0xCCCCCCCD) followed by division by 235 (or 35 right bit shift). OEIS provides sequences of the constants for multiplication as and for the right shift as . For general -bit unsigned integer division where the divisor is not a power of 2, the following identity converts the division into two -bit addition/subtraction, one -bit by -bit multiplication (where only the upper half of the result is used) and several shifts, after precomputing k=x+\lceil\log_2{D}\rceil and a=\left\lceil\frac{2^k}{D}\right\rceil-2^x: : \left\lfloor\frac{N}{D}\right\rfloor=\left\lfloor\frac{\left\lfloor\frac{N-b}{2}\right\rfloor+b}{2^{k-x-1}}\right\rfloor \text{ where } b=\left\lfloor\frac{Na}{2^x}\right\rfloor In some cases, division by a constant can be accomplished in even less time by converting the "multiply by a constant" into a series of shifts and adds or subtracts. Of particular interest is division by 10, for which the exact quotient is obtained, with remainder if required. == Rounding error ==

Rounding error

When a division operation is performed, the exact quotient q and remainder r are approximated to fit within the computer's precision limits. The Division Algorithm states: [ a = bq + r ] where 0 \leq r . In floating-point arithmetic, the quotient q is represented as \tilde{q} and the remainder r as \tilde{r}, introducing rounding errors \epsilon_q and \epsilon_r: [ \tilde{q} = q + \epsilon_q ] [ \tilde{r} = r + \epsilon_r ] This rounding causes a small error, which can propagate and accumulate through subsequent calculations. Such errors are particularly pronounced in iterative processes and when subtracting nearly equal values - is told loss of significance. To mitigate these errors, techniques such as the use of guard digits or higher precision (or arbitrary precision) are employed. == See also ==

Source: Wikipedia ↗

tickerdossier.com tickerdossier.substack.com