|Introduction||Code Architecture||Low Level Strategies||Common Misconceptions||A Few Good Tricks||Links||High Level vs. Low Level|
This is a page about the elusive subject of program performance optimization. Before you proceed towards such lofty goals, you should examine your reasons for doing so. Optimization is but one of many desirable goals in software engineering, and is often antagonistic to other important goals such as stability, maintainability, and portability. At its most cursory level (efficient implementation, clean non-redundant interfaces) optimization is beneficial and should always be applied. But at its most intrusive (inline assembly, pre-compiled/self-modified code, loop unrolling, bit-fielding, superscalar and vectorizing) it can be an unending source of time consuming implementation and bug hunting. Be cautious and wary of the cost of optimizing your code.
You will probably notice a large slant towards Intel x86 based optimization techniques, which shouldn't surprise many, since that is where my background is strongest. On the other hand I have used various other architectures, run profilers and debuggers on a variety of non-x86 UNIX boxes; I have tried to be as general as possible where I can. However, many of the recommendations and techniques may simply not work for your processor or environment. In that event, I should emphasize that first hand knowledge is always better than following advice. I would also appreciate any feedback you might have regarding other platforms or other optimization techniques you have experience with.
I have written up this page for a few reasons: (1) I have seen almost nothing of reasonable quality or depth elsewhere (2) I hope to get feedback from others who may know a thing or two, that I don't (3) To enrich the quality of the internet (4) Expensive books on this subject have been getting a bit too much attention (5) To get more hits on my web page :o)
"None of us learn in a vacuum; we all stand on the shoulders of giants such as Wirth and Knuth and thousands of others. Lend your shoulders to building the future!"
|The General Task of Code Optimization|
It should be pointed out, that over time the idea of code optimization has evolved to include "execution profiling" (i.e., direct measurement of "hotspots" in the code from a test run) as its guiding strategy. To understand this and put it into perspective, let us first break down the performance of a program down as follows:
What is typically seen is that Total time is usually dominated by one or very few of the time of taski terms because either those tasks are where there is the most work to do or where their algorithms are complicated enough for them to have a low rate of execution. This effect tends to be recursive in the sense of breaking tasks down to sub-tasks, which has lead to the creation and adoption of line-by-line based execution profilers.
What one should also notice is that if the strategy for improving performance of code is to improve the efficiency of the code for one particular taski we see that its possible to reach a point of diminishing returns, because each term cannot go below 0 in its overall contribution to the time consumption. So an optimization exercise of this sort for any particular task need not proceed past the point where the total time taken for any taski is significantly below the average. This usually means that effort is better spent in switching to another taski to optimize.
This, of course, is fine and a credible way of proceeding, however, by itself leaves out algorithm analysis, as well as the possibility of trying different task breakdowns. The classic example here is that tweaking bubble sort is not going to be as useful as switching to a better algorithm, such as quicksort or heapsort if you are sorting a lot of data.
One also has to be careful about what actually is being measured. This is epitomized by the story that if the Unix kernel typically spends most of its time in the idle loop, that doesn't mean there will be any useful advantage in optimizing it.
In general design changes tend to affect performance more than "code tweaking". So clearly one should not skip straight to assembly language until higher level approaches have been exhausted.
|Low Level Strategies|
When all algorithmic optimization approaches have been tried and you are still an order of magnitude away from your target performance you may find that you have no choice but to analyze the situation at the lowest possible level. 3 useful resources at this point are (1) Paul Hsieh's x86 Assembly language Page (2) The Zen of Code Optimization (by Michael Abrash) and (3) an html page entitled Pentium Optimizations by Agner Fog
The following is a list of things programmers often take as either self-evident or have picked up from others and mistakenly believe. In disputing these misbeliefs I only intend to improve awareness about performance optimization. But in doing so, I only offer half the solution. To be really convinced the aspiring performance conscious programmer should write up examples and disassemble them (in the case of using a HLL like C) or simply profile or time them to see for themselves.
|A Few Good Tricks|
|Although I use C syntax in the examples
below, these techniques
clearly apply to other languages just as well.
"Else" clause removal
The performance of if-then-else is one taken jump no matter what. However, often the condition has a lop-sided probability in which case other approaches should be considered. The elimination of branching is an important concern with today's deeply pipelined processor architectures. The reason is that a "mispredicted" branch often costs many cycles.
Clearly this only works if Undo Case B; is possible. However, if it is, this technique has the advantage that the jump taken case can be optimized according to the Condition probability and Undo Case B; Case A; might be merged together to be more optimal than executing each separately.
Obviously you would swap cases A and B depending on which way the probability goes. Also since this optimization is dependent on sacrificing performance of one set of circumstances for another, you will need to time it to see if it is really worth it. (On processors such as the ARM or Pentium II, you can also use conditional mov instructions to achieve a similar result.)
Use finite differences to avoid multiplies
This one should be fairly obvious, use constant increments instead of multiplies if this is possible. (Believe it or not, however, some C compilers are clever enough to figure this out for you in some simple cases.)
Use powers of two for multidimensional arrays
The advantage of using powers of two for all but the leftmost array size is when accessing the array. Ordinarily the compiled code would have to compute a multiply to get the address of an indexed element from a multidimensional array, but most compilers will replace a constant multiply with a shift if it can. Shifts are ordinarily much faster than multiplies.
Optimizing loop overhead
Many compilers can't optimize certain loops into the form most suitable for them. Ordinarily most CPUs have a conditional branch mechanism that works well when counting down from a positive number to zero or negative one. The conditional clause for any for loop must be evaluated on the first pass, as with any other time through the loop, but often is a trivial TRUE:
Many compilers will see this optimization as well.
Data type considerations
Often to conserve on space you will be tempted to mix integer data types; chars for small counters, shorts for slightly larger counters and only use longs or ints when you really have to. While this may seem to make sense from a space utilization point of view, most CPUs have to end up wasting precious cycles to convert from one data type to another, especially when preserving sign.
Declare local functions as "static"
Doing so tells the compiler that the function need not be so general as to service arbitrary general calls from unrelated modules. If the function is small enough, it may be inlined without having to maintain an external copy. If the function's address is never taken, the compiler can try to simplify and rearrange usage of it within other functions.
Observe the large latency of modern CPUs
The language definition for ANSI C, does not allow it take advantage of abelian math operators for better scheduling. This will become increasingly important as CPUs in the future move towards having ALUs with larger latencies (takes longer to perform operation -- this allows for high frequency for the CPU.)
Some compilers allow for "trade accuracy for speed with respect to floating point" switches. In theory this might allow them to see this sort of translation, however I have never seen any such thing.
Also, in theory, a SIMD enabled compiler could direct the above
to a SIMD
instruction set (such as 3DNow!, SSE or AltiVec.) I definitely
have not seen
|High Level vs. Low Level|
In the old days, it was pretty easy to understand that writing your programs in assembly would tend to yield higher performing results than writing in higher level languages. Compilers had solved the problem of "optimization" in too general and simplistic a way, and had no hope of competing with the sly human assembly coder.
These days the story is slightly different. Compilers have gotten better and the CPUs have gotten harder to optimize for. Inside some research organizations the general consensus is that compilers could do at least as well as humans in almost all cases. During a presentation I gave to some folks at AT&T Bell labs (a former employer) I explained that I was going to implement a certain piece of software in assembly language, which raised eyebrows. One person went so far as to stop me and suggest a good C/C++ compiler that would do a very good job of generating optimized object code and make my life a lot easier.
But have compilers really gotten so good that humans cannot compete? I offer the following facts: High level languages like C and C++ treat the host CPU in a very generic manner. While local optimizations such as loop unrolling, and register resource contention are easy for compilers to deal with, odd features like 32 byte cache lines, 8K data/code cache totals, multiple execution units, and burst device memory interfaces are something not easily expressed or exploited by a C/C++ compiler.
On a Pentium, it is ordinarily beneficial to declare your data so that its usage on inner loops retains as much as possible in the cache for as long as possible. This can require bizarre declaration requirements which are most easily dealt with by using unions of 8K structures for all data used in your inner loops. This way you can overlap data with poor cache coherency together, while using as much of the remainder of the cache for data with good cache coherency.
The pentium also has an auxiliary floating point execution unit which can actually perform floating point operations concurrently with integer computations. This can lead to algorithmic designs which require an odd arrangement of your code, that has no sensible correspondence with high level code that computes the same thing.
Basically, on the pentium, C/C++ compilers have no easy way to translate source code to cache structured aware data and code along with concurrently running floating point. The MMX generation of x86 processors will pose even greater problems. Nevertheless I explained to the folks at Bell Labs that I owned the compiler that they suggested, and that when it came to optimizations, I could (can) easily code circles around it.
The classic example of overlooking these points above is that of one magnanimous programmer who came from a large company and declared to the world, through USENET, that he was going to write a "100% optimal 3D graphics library completely in C++". He emphatically defended his approach with long flaming postings insisting that modern C++ compilers would be able to duplicate any hand rolled assembly language optimization trick. He got most of the way through before abandoning his project. He eventually realized that the only viable solution for existing PC platforms is to exploit the potential for pipelining the FPU and the integer CPU instructions in a maximally concurrent way. Something no x86 based C compiler in production today is capable of doing.
I always roll my eyes when the old debate of whether or not the days of hand rolled assembly language are over resurfaces in the USENET assembly language newsgroups. On the other hand perhaps I should be thankful since these false beliefs about the abilities of C/C++ compilers in other programmers only, by comparison, differentiates my abilities more clearly to my employer.
The conclusion you should take away from this (and my other related web pages) is that when pedal to the metal kinds of performance is required, that there is a significant performance margin to be gained in using assembly language. Ordinarily, one combines C/C++ and assembly by using the compiler's inline assembly feature, or by linking to a library of hand rolled assembly routines.
Here's an off the cuff post from Anthony Tribelli on rec.games.programmer:
More recently, I got into a discussion with my brother about how he might optimize a totally C++ written program to improve the performance based on branching and floating point characteristics of modern x86 processors. After I did a brain dump and suggested some non-portable, assembly motivated techniques, he said he was able to use those suggestions alone to make his program run 10 times as fast!. This stuff is not trivial.
For more, see Randall Hyde's Great Debate page.