P5 Optimization

Pentium Optimizations

The most notable content on earlier versions of this page has gone out of date and has now been superceded by the original author, Agner Fog. I am no longer motivated to keep it up to date, so the old content has been removed. I would recommend that you visit Agner's own page on How to optimize for the Pentium® microprocessors for the most up to date information on Pentium Optimization.

Examples

Example 1

[Old work removed] While the old loops that were here not bad, they were woefully inadequate in comparision to what is possible. I recently had reason to revisit the mandelbrot inner loop problem and have found a couple tremendous speed ups. Recall that the critical inner loop looks like:

a = c_re; b = c_im; for( i=0; i<LIM; i++ ) { x = a*b; a *= a; b *= b; if( a + b > 4.0 ) break; a = a - b + c_re; b = x + x + c_im; } plot( c_im, c_re, i );

The first speed up is due to Damien Jones:

1) Rather than checking the exit condition for the (a,b) point on every iteration, why not unroll loop the several times and only check for escape after a few iterations?

a = c_re; b = c_im; for( i=3; i<LIM; i+=4 ) { oa = a; ob = b; x = 2*a*b + c_im; a = a*a - b*b + c_re; b = 2*a*x + c_im; a = a*a - x*x + c_re; x = 2*a*b + c_im; a = a*a - b*b + c_re; b = a*x; x *= x; a *= a; if( a + x > 4.0 ) break; a = a - x + c_re; b = b + b + c_em; } a = oa; b = ob; i -= 3; for( ; i<LIM; i++ ) { x = a*b; a *= a; b *= b; if( a + b > 4.0 ) break; a = a - b + c_re; b = x + x + c_im; } plot( c_im, c_re, i );

2) Since the FPU operations are dependent on each other, which serializes the computation for the most part we need to find some way of increasing parallelism. The only thing we really have available to us are the many points that consititute the mandelbrot set. So to speed it up we much compute serval points at once. Just to give you a taste of the inner loop:

for( i=0; i<LeastCountLeft; i++ ) { x[0] = a[0]*b[0]; x[1] = a[1]*b[1]; x[2] = a[2]*b[2]; x[3] = a[3]*b[3]; a[0] *= a[0]; a[1] *= a[1]; a[2] *= a[2]; a[3] *= a[3]; b[0] *= b[0]; b[1] *= b[1]; b[2] *= b[2]; b[3] *= b[3]; if( a[0] + b[0] > 4.0 ) break; if( a[1] + b[1] > 4.0 ) break; if( a[2] + b[2] > 4.0 ) break; if( a[3] + b[3] > 4.0 ) break; a[0] = a[0] - b[0] + c_re[0]; a[1] = a[1] - b[1] + c_re[1]; a[2] = a[2] - b[2] + c_re[2]; a[3] = a[3] - b[3] + c_re[3]; b[0] = x[0] + x[0] + c_im[0]; b[1] = x[1] + x[1] + c_im[1]; b[2] = x[2] + x[2] + c_im[2]; b[3] = x[3] + x[3] + c_im[3]; }

Looking at the exit conditions it should be obvious that the various points, are initialized at different times with respect to each other and each time through the above "engine" that somewhere between 1 and 4 points will be completed each time through. So the the overhead is quite complicated. But it is will worth it for the performance gained through the high degree of parallelism now opened up.

The final coup de graspe is seen by observing that the above inner is easily converted to a SIMD instruction set. So MMX (fixed point), 3DNow! or SSE can be used. (32 bit FP can be used for resolutions/zooms up to about 65kx65K.)

The most ideal compromise between the two ideas above that fits in eight 2-way 3DNow! registers is to unroll once, and SIMD to 4 points. Doing so uses every register as well as memory operands for the c_??[*] variables. The final loop is not shown here because of its sheer complexity (so it is left as an exercize to the reader).

When converting to assembly language, remember that you can use integer compares instead of the long latency floating point comparisons. Highly out of order processors like the P-II and Athlon also have more important considerations than exact dependency scheduling. For those processors be sure that as many of your instructions are aligned so as not to straddle 16 byte boundaries as possible.

Side note on Intel's sample MMX implementation

I would just like to take this moment to point something out. On Intel's MMX developer example web pages, they have posted some code to demonstrate the advantages of using MMX code for precisely this application. I was flabergasted by what I saw there. I don't believe they do a fair comparison. They have written code that is so bad as to be basically unusable for the purposes of any remotely practical Mandelbrot generation. I can't understand why they have done this.

First of all, as is demonstrated by my program, low zooms of Mandelbrot sets are completely uninteresting as they can be computed in less than a couple seconds on even the slowest of Pentium based computers. However, because they choose only 16 bits of precision for their calculations (so that they can do many of them at once using MMX style SIMD) they will not be able to generalize their algorithm for deep zooms (they would lose accuracy very quickly.) The sample image, itself practically betrays the fact that the output quality of their algorithm is quite poor.

Second the comparative FPU based code is a complete sham. There is essentially no pipelining or attempt at ordinary Pentium optimization at all. The code above, (even my code) on a gut feel should execute between 2 and 3 times as fast as the code given in Intel's example. Note that they use FSTSW, they have unnecessary resource contentions on FPU registers, and use FST in inappropriate ways while not once using FXCH to maximally leverage pipelining. Given the totally unpipelined way it has been written, it looks like the AMD K6 could beat the Pentium or Pentium-II on that code by as much as 50%!

I can only imagine this was shown in this way because it is probably not much worse than your typical C compiler so Intel may have figured that programmers would not notice. Problem is, most programmers that can read the code and actually understand the issues well enough to understand how the code works are more likely to know a thing or too about optimization, and therefore know that the code shown is basically junk.

Just to make myself clear I claim that all three routines given by Intel are garbage. The MMX one can't be used for zooms, the FPU one is poorly pipelined and using integer code is just a bad way to try to do it right from the start. In my opinion, Intel scores 3 goose eggs.

Example 2

There was a USENET posting, where the author was trying to gage the performance of the FPU's multiplication capabilities. After pointing out an obvious flaw in his first attempt (his accumulated multiplies were causing overflow exceptions which significantly slowed the FPU performance), the following C code was arrived at:

#include <stdio.h>

int timeGetTime(void);
#pragma aux timeGetTime = ".586" "rdtsc" value [eax] modify [edx];

void DoFpMult(void) {
    int i;
    float val2 = (float)1.00007;
    float val1 = (float)1.2;
    int startTime;

    startTime = timeGetTime();

    for (i = 0; i < 1000000; i++) {
        val1 *= val2;
    }

    printf("Took %d clocks result:%f\n",timeGetTime()-startTime,val1);
}

Of course it can be optimized with a simple exp/log calculation, but as I explained above, the point of the exercise was primarily to estimate the performance of the FPU's multiplier.

According to VTUNE and the actual results the floating point version was faster than the similarly written integer (fixed point) code by only about 35%. This result appeared to hold true for both the WATCOM C/C++ compiler and the Visual C++ compiler. After disassembling the code, it was fairly clear why the C compilers were not doing better. Here is the inner loop WATCOM C/C++ gave:

L1: fld     st(1)
    inc     eax
    fmul    st,st(1)
    fstp    st(2)
    cmp     eax,1000000
    jl      L1

Although the compiler does seem to realize that it should shuffle integer instructions that dont pair in between FPU instructions (to take advantage of shadowing any incidental stalls that might occur between them,) the problem, of course, is that the FPU stack is being shuffled around unnecessarily. Clearly the ideal inner loop looks like:

L1: fmul    st,st(1)
    dec     ecx
    jne     L1

Unrolling doesn't help since the throughput of the fmul is 3 clocks anyways. This code can be fused back into the original C code using the WATCOM C/C++ compiler as follows:

#include <stdio.h>

int timeGetTime(void);
#pragma aux timeGetTime = ".586" "rdtsc" value [eax] modify [edx];

float FPMUL(float v1, float v2, int count);
#pragma aux FPMUL =   \
" L1: fmul st,st(1) " \
"     dec ecx       " \
"     jnz L1        " \
"     fxch          " \
"     fstp st(0)    " \
value [8087] parm [8087] [8087] [ecx];

void DoFpMult(void) {
    float   val2 = (float)1.00007;
    float   val1 = (float)1.2;
    int     startTime;

    startTime = timeGetTime();

    val1 = FPMUL(val1,val2,1000000);

    printf("Took %d clocks result:%f\n",timeGetTime()-startTime,val1);
}

The mechanism for VC++ works differently, requiring you to handle all the passing from float variables to the FPU stack yourself, however it should be possible to have at least the same inner loop. This loop actually doubles the performance making it run 3 times faster than comparable integer code. While not a Pentium specific optimization, I believe it is important to understand how to use the floating point unit, to use the Pentium most effectively, far more so than previous generation x86s. The lesson here is that you should not blindly trust your C compiler even if they advertise that they have Pentium Optimizations.

Other References

A comparison of 6th generation x86 architectures
Assembly programming examples
The Post-RISC Era an explanation of modern CPU architecture.
Ryan Giess' Pentium programming FAQs
A explanation of memory caching in modern CPUs
Calculate the clock rate of your Pentium
Optimized Block memory copying
More general x86 optimizations
Arithmetic: Multiplication, Division and Square Root
Vzzrzzn's Optimizations Page
Chris Ludloff's Sandpile site
Robert Collin's Intel Secrets site
Background article from BYTE magazine
Intel's VTune (a Windows performance optimization tool)
Mike Schmit's Pentium Optimization tools
An excerpt and book offer from The Zen of Code Optimization
Inner Loops by Rick Booth (Addison Wesley) in Computer related bookstores now.