
However, the x86 ISA also includes lea, sub and mov (for
shuffling partial results amongst other registers.) Instructions like sub and
mov are self explanatory, but lea is a very curious and powerful instruction
that gives some multiregister linear combinations as well as some additional
odd multiplies. The x86's "lea" instruction allows you to compute:
Below I give a list of fixed multiplies that includes shl, add, lea, sub and mov instructions. The results were arrived at automatically using an iterated depthfirst search. In a previous attempt I wrote a program to do this with a technique called Dynamic Programming which uses an open list of nodes, that simply pushed on the frontiers of the cycle count in an ordered way. However this apporach was not very memory efficient which turns out to be the limiting factor for this type of problem.
My previous approach was also limited in modelling the cycle expendature of the instruction sequence. It simply looked at the current instruction and added its "cost" to the instruction stream. Since all these instructions on the Pentium take one clock with all things being equal all I did was measure the cost as the length of the instruction. So it could find sequences of minimal length, but not maximal performance.
My new approach very literally models the Pentium's register contention and address generation interlocking stalls as best as I know them. For reference I used Agner Fog's Pentium Optimization documentation. I also discussed this with Terje Mathisen who is a well known optimization expert. There is source code accompanying the table for pentium optimized constant multiplies given below.
The advantage of this more general approach over the one recommended by Intel, of course, is that instead of being limited to binary polynomials in a given decomposition (namely "horner's decomposition"), factorings and some limited diophantine (linear) expressions are used.
In fact, these code sequences are "provably optimal", with the caveats that only these instruction (shl, mov, sub, add, lea) are assumed to be useful and that use of no more than one other register (ebx) is useful. The first assumption I feel to be acceptable (I challenge anyone to prove me wrong), but the second is likely to have an nontrivial impact for longer sequences. From a practical point of view, however, this is not such a big concern since the mul instruction itself put an upper bound of 9 clocks on the whole problem anyways.
Old, size optimized table  New, pentium performance optimized table 