# Integer division and modulus by constants.

by Paul Hsieh

Division is a very slow, complex procedure when implemented for general divisors. However, if the divisor is a constant we can simply use fixed point approximation methods to multiply by the reciprocal:

 ``` a/b == (int)(((2^32)+b-1)/b) * (a) / (2^32) ```

Clearly, (int)(((2^32)+b-1)/b) is precalculated. Similarly, modulus for the number k*2^t (where k is odd) is calculated as:

 ` a % (k*2^t) == a - (((((int)(((2^32)+k-1)/k)) * (a) / (2^32))) & (-2^t))*k `

For example, to calculate a (mod 320) the code would look like:

 ``` mov edx,033333334h ; (int)(((2^32)+k-1)/k) mov ebx,eax mul edx and edx,0FFFFFFC0h ; ( ? / (2^32)) & (-2^t) lea edx,[edx+edx*4] ; ? * k sub ebx,edx ; a - ?```

The approximation is very good for reasonably small "b" or "k*2^t". With some work, the constant multiplies can also be gotten rid of (which in many cases is a good idea since mul is not a very fast instruction for the x86 based PC.) See: constant multiplies for more information.

# Using Newton's Method for divide approximation

On the USENET, Richard Cant writes:
This technique is based on the use of Newtons method to solve the equation:
 x-1/y = 0
(i.e. f(y) = 0) given a first guess at the solution
 y = y1
A better approximation is given by
 y2 = y1-f(y1)/f'(y1)
i.e.
 y2 = y1*(2-x*y1)
This process can be repeated indefinitely - giving an answer with approximately twice as many significant bits each time.

Given a fast multiplier and a moderately accurate lookup table to start with it is possible to produce a 24 bit result in about 3-4 machine cycles (1 iteration) and a 48 bit result in about 6 machine cycles (2 iterations). The method is thus a good compromise between accuracy latency and throughput - best suited to achieving minimum latency with accuracies greater than can be achieved using lookup table methods.

It's often the best method if latency is the issue - not if the divide can be done in a separate unit and you don't mind the delay.

(NB the "cycles" referred to above are for an idealised machine - not necessarily any real processor such as the Pentium)

--
Richard Cant

He does not specify the initial approximation which, it turns out, is crucial for proper convergence. Left as an exercise to the reader.

# Using Power Series (for floating point numbers)

Those familliar with power series and generating function will recall that for a number:
 x = 2^e * (1 + m), where 0 <= m < 1
we know that:
 1/x = 2^(-e) * (1 - m + m^2 - m^3 + m^4 -+ ...), where 0 <= m < 1
which can be re-expressed as:
 1/x = 2^(-e) * (1 - m) * (1 + m^2) * (1 + m^4) * (1 + m^8) * (1 + m^16) * (1 + m^32) * ... , where 0 <= m < 1
Of course, for the best convergence, chosing -0.5 <= m < 0.5 would be ideal. So if m >= 0.5 then:
 e' = e+1 m' = -m/2
So, without loss of generality we can assume -0.5 <= m < 0.5 and compute an approximation for 1/x as:
 ```y0 = 1 - m; m *= m; y1 = 1 + m; m *= m; y2 = 1 + m; m *= m; rx = y0*y1; y3 = 1 + m; m *= m; rx *= y2; y4 = 1 + m; rx *= y3; rx *= y4;```
So,
 1/x = 2^(-e) * (1 + (rx - 1))
But if rx < 1 (i.e., m > 0)
 M = 2*(1 - rx) E = -e - 1
Otherwise,
 M = (rx - 1) E = -e
Which leads us to the final desired result of:
 1/x = 2^E * (1 + M)

And if you just can't get enough of computational division algorithms, here are some papers about how they are done in modern CPUs today.

I wonder which method is being used in the graphics world these days ...     