 Integer multiplying by constants
© Copyright 1996-2004 by Paul Hsieh
As discussed in various Pentium optimization articles and books, multiplies by fixed constants should, in many cases be handled by shifts and adds. Intel recommends that any fixed integer with fewer than 8 "1" bits set should be decomposed to shifts and adds. The implication is that you should do the obvious "horner's rule" method of just shifting and adding. For example, 181 = 1+(1+(1+(1+4)*2)*4)*4, so x*181 might be encoded as x+(x+(x+(x+(x<<2))<<1)<<2)<<2 which has an obvious assembly language translation into the one cycle shl and add instructions. In fact it is not hard at to come up with an algorithm that gives this decomposition, and even compiles up the assembly instructions themselves in an automated way.

However, the x86 ISA also includes lea, sub and mov (for shuffling partial results amongst other registers.) Instructions like sub and mov are self explanatory, but lea is a very curious and powerful instruction that gives some multiregister linear combinations as well as some additional odd multiplies. The x86's "lea" instruction allows you to compute:

r0 := r1*{0,1} + r2*{0,1,2,4,8}

for any 3 general purpose 32 bit registers, in a single clock. "sub" (subtract) and "mov" (copy value to/from register) are self explanatory. So using these instructions, we see that 181 = 4*(5*9)+1 which can be encoded as: b = x*5; b = b*9; return x + b*4;

Below I give a list of fixed multiplies that includes shl, add, lea, sub and mov instructions. The results were arrived at automatically using an iterated depth-first search. In a previous attempt I wrote a program to do this with a technique called Dynamic Programming which uses an open list of nodes, that simply pushed on the frontiers of the cycle count in an ordered way. However this apporach was not very memory efficient which turns out to be the limiting factor for this type of problem.

My previous approach was also limited in modelling the cycle expendature of the instruction sequence. It simply looked at the current instruction and added its "cost" to the instruction stream. Since all these instructions on the Pentium take one clock with all things being equal all I did was measure the cost as the length of the instruction. So it could find sequences of minimal length, but not maximal performance.

My new approach very literally models the Pentium's register contention and address generation interlocking stalls as best as I know them. For reference I used Agner Fog's Pentium Optimization documentation. I also discussed this with Terje Mathisen who is a well known optimization expert. There is source code accompanying the table for pentium optimized constant multiplies given below.

The advantage of this more general approach over the one recommended by Intel, of course, is that instead of being limited to binary polynomials in a given decomposition (namely "horner's decomposition"), factorings and some limited diophantine (linear) expressions are used.

In fact, these code sequences are "provably optimal", with the caveats that only these instruction (shl, mov, sub, add, lea) are assumed to be useful and that use of no more than one other register (ebx) is useful. The first assumption I feel to be acceptable (I challenge anyone to prove me wrong), but the second is likely to have an non-trivial impact for longer sequences. From a practical point of view, however, this is not such a big concern since the mul instruction itself put an upper bound of 9 clocks on the whole problem anyways.     