Integer division and modulus by constants.
by Paul Hsieh
Division is a very slow, complex procedure when implemented for
general divisors. However, if the divisor is a constant we can
simply use fixed point approximation methods to multiply by the
a/b == (int)(((2^32)+b-1)/b) * (a) / (2^32)
Clearly, (int)(((2^32)+b-1)/b) is precalculated.
Similarly, modulus for the number k*2^t (where k is odd) is
a % (k*2^t) == a - (((((int)(((2^32)+k-1)/k)) * (a) / (2^32))) & (-2^t))*k
For example, to calculate a (mod 320) the code would look like:
mov edx,033333334h ; (int)(((2^32)+k-1)/k)
and edx,0FFFFFFC0h ; ( ? / (2^32)) & (-2^t)
lea edx,[edx+edx*4] ; ? * k
sub ebx,edx ; a - ?
The approximation is very good for reasonably small "b" or "k*2^t".
With some work, the constant multiplies can also be gotten rid of
(which in many cases is a good idea since mul is not a very fast
instruction for the x86 based PC.) See: constant
multiplies for more information.
Using Newton's Method for divide approximation
On the USENET, Richard Cant writes:
He does not specify the initial approximation which, it turns out, is
crucial for proper convergence. Left as an exercise to the reader.
This technique is based on the use of Newtons method to solve the
(i.e. f(y) = 0)
given a first guess at the solution
A better approximation is given by
This process can be repeated indefinitely - giving an answer with
approximately twice as many significant bits each time.
Given a fast multiplier and a moderately accurate lookup table to
start with it is possible to produce a 24 bit result in about 3-4
machine cycles (1 iteration) and a 48 bit result in about 6 machine
cycles (2 iterations). The method is thus a good compromise
between accuracy latency and throughput - best suited to achieving
minimum latency with accuracies greater than can be achieved
using lookup table methods.
It's often the best method if latency is the issue - not if the divide
can be done in a separate unit and you don't mind the delay.
(NB the "cycles" referred to above are for an idealised machine - not
necessarily any real processor such as the Pentium)
Using Power Series (for floating point numbers)
Those familliar with power series and generating function will recall that
for a number:
we know that:
x = 2^e * (1 + m), where 0 <= m < 1
which can be re-expressed as:
1/x = 2^(-e) * (1 - m + m^2 - m^3 + m^4 -+ ...), where 0 <= m < 1
Of course, for the best convergence, chosing -0.5 <= m < 0.5 would be
ideal. So if m >= 0.5 then:
So, without loss of generality we can assume -0.5 <= m < 0.5 and compute
an approximation for 1/x as:
1/x = 2^(-e) * (1 - m) * (1 + m^2) * (1 + m^4) * (1 + m^8) * (1 + m^16) * (1 + m^32) * ... , where 0 <= m < 1
y0 = 1 - m; m *= m;
y1 = 1 + m; m *= m;
y2 = 1 + m; m *= m; rx = y0*y1;
y3 = 1 + m; m *= m; rx *= y2;
y4 = 1 + m; rx *= y3;
rx *= y4;
But if rx < 1 (i.e., m > 0)
1/x = 2^(-e) * (1 + (rx - 1))
Which leads us to the final desired result of:
M = 2*(1 - rx)|
E = -e - 1
And if you just can't get enough of computational division algorithms, here
are some papers about how they are done in modern CPUs today.
I wonder which method is being used in the graphics world these days ...