Integer division and modulus by constants.

by Paul Hsieh

Division is a very slow, complex procedure when implemented for general divisors. However, if the divisor is a constant we can simply use fixed point approximation methods to multiply by the reciprocal:

    a/b == (int)(((2^32)+b-1)/b) * (a) / (2^32)    

Clearly, (int)(((2^32)+b-1)/b) is precalculated. Similarly, modulus for the number k*2^t (where k is odd) is calculated as:

    a % (k*2^t) == a - (((((int)(((2^32)+k-1)/k)) * (a) / (2^32))) & (-2^t))*k    

For example, to calculate a (mod 320) the code would look like:


    mov   edx,033333334h        ; (int)(((2^32)+k-1)/k)
    mov   ebx,eax
    mul   edx
    and   edx,0FFFFFFC0h        ; ( ? / (2^32)) & (-2^t)
    lea   edx,[edx+edx*4]       ; ? * k
    sub   ebx,edx               ; a - ?

The approximation is very good for reasonably small "b" or "k*2^t". With some work, the constant multiplies can also be gotten rid of (which in many cases is a good idea since mul is not a very fast instruction for the x86 based PC.) See: constant multiplies for more information.


Using Newton's Method for divide approximation

On the USENET, Richard Cant writes:
This technique is based on the use of Newtons method to solve the equation:
x-1/y = 0
(i.e. f(y) = 0) given a first guess at the solution
y = y1
A better approximation is given by
y2 = y1-f(y1)/f'(y1)
i.e.
y2 = y1*(2-x*y1)
This process can be repeated indefinitely - giving an answer with approximately twice as many significant bits each time.

Given a fast multiplier and a moderately accurate lookup table to start with it is possible to produce a 24 bit result in about 3-4 machine cycles (1 iteration) and a 48 bit result in about 6 machine cycles (2 iterations). The method is thus a good compromise between accuracy latency and throughput - best suited to achieving minimum latency with accuracies greater than can be achieved using lookup table methods.

It's often the best method if latency is the issue - not if the divide can be done in a separate unit and you don't mind the delay.

(NB the "cycles" referred to above are for an idealised machine - not necessarily any real processor such as the Pentium)

--
Richard Cant

He does not specify the initial approximation which, it turns out, is crucial for proper convergence. Left as an exercise to the reader.

Using Power Series (for floating point numbers)

Those familliar with power series and generating function will recall that for a number:
x = 2^e * (1 + m), where 0 <= m < 1
we know that:
1/x = 2^(-e) * (1 - m + m^2 - m^3 + m^4 -+ ...), where 0 <= m < 1
which can be re-expressed as:
1/x = 2^(-e) * (1 - m) * (1 + m^2) * (1 + m^4) * (1 + m^8) * (1 + m^16) * (1 + m^32) * ... , where 0 <= m < 1
Of course, for the best convergence, chosing -0.5 <= m < 0.5 would be ideal. So if m >= 0.5 then:
e' = e+1
m' = -m/2
So, without loss of generality we can assume -0.5 <= m < 0.5 and compute an approximation for 1/x as:
y0 = 1 - m; m *= m;
y1 = 1 + m; m *= m;
y2 = 1 + m; m *= m; rx  = y0*y1;
y3 = 1 + m; m *= m; rx *= y2;
y4 = 1 + m;         rx *= y3;
                    rx *= y4;
So,
1/x = 2^(-e) * (1 + (rx - 1))
But if rx < 1 (i.e., m > 0)
M = 2*(1 - rx)
E = -e - 1
Otherwise,
M = (rx - 1)
E = -e
Which leads us to the final desired result of:
1/x = 2^E * (1 + M)

And if you just can't get enough of computational division algorithms, here are some papers about how they are done in modern CPUs today.

I wonder which method is being used in the graphics world these days ...


My home page x86 Assembly Optimizations Programming Mail me

Valid HTML 4.0!