Paul Hsieh's Block Copy Analysis Page

Optimized Block memory transfers on the Pentium

Background

So how do you do it?

Background

Triton

burst

Burst

This means that there are four things that determine peak memory bandwidth performance:

(1) Your CPU: how fast can it issue reads and writes?
(2) Your chipset: how efficiently does it schedule reads/writes?
(3) The PCI Bus: 33 Mhz bottleneck + burst contraints. (if device memory is used)
(4) Your RAM speed: DRAM vs. EDO ram vs. SDRAM, etc. etc.

It turns out that for CPUs up to, at least, 166Mhz Pentiums that the Triton II chipset + PCI bus performs no differently, at burst rates, than just the Triton II chipset alone for writing. That is to say, sequential, aligned writes to device memory, such as graphics memory, are no different than writing to system memory. The reason for this is because a significant amount of memory writes to system memory must flush out of the cache roughly as fast as they are put there; unlike reads, writes have to be flushed out of the cache to memory. Furthermore, the device memory speed for modern graphics cards are high enough to receive PCI burst rate transfers.

By contrast reads from device memory are significantly (around the 5 to 20 times ballpark) slower than system memory reads. Here the CPU is compensating more than anything else due to its ability to cache memory reads. In fact system memory reads are significantly faster (50%-100% faster) than memory writes, which are in turn significantly faster than device memory reads. Also, since system memory reads are cached, all their bottlenecks are in the CPU; and the CPU can deal with unaligned data transfers just as efficiently as aligned data transfers.

When doing a memory to memory transfer, if the source memory is cached system memory, the chipset is also able to pipeline to memory accesses to a large extent. In fact, system memory to memory transfers were only 50% slower than straight arbitrary memory writes in experiments I ran.

So the PCI Bus is probably ok, for now (so long as nobody comes up with anything better than Intel's Triton chipset.) However, clearly faster CPUs and faster memories will start to expose the PCI bus as a bottleneck (33 Mhz vs. CPU speeds of 200+ Mhz?) Intel plans to deal with this by eventually moving people to a 66Mhz version of the PCI bus. They will leverage the AGP architecture to offset the backlash from incompatibilities that the move to a 66Mhz PCI bus will cause. Older buses, such as the ISA bus allow 8.3Mhz transfers of 16 bits per transaction (clearly a bottleneck.)

So how do you do it?

While sequential memory access is a simple concept to deal with, alignment is not always so simple (in the case of graphics, the target for a set of "pixels" is not ordinarily aligned.) So this all boils down to: How does one decompose a memory transfer operation into a maximally sequential and aligned set of sub-block copy operations?

The answer, I claim, is to simply do a small number of byte writes at the beginning and ends, such that there is a maximal destination dword aligned sequential middle. The following code is my attempt to do just that; it is plug-in substitutable for "rep movsb" in most cases. (It works, as is, for the WATCOM C/C++ compiler.)

void BlockCopy(char * Src, char * Dest, unsigned int Len); #pragma aux BlockCopy = \ " mov eax, ecx " \ " sub ecx, edi " \ " sub ecx, eax " \ " and ecx, 3 " \ " sub eax, ecx " \ " jle short LEndBytes " \ " rep movsb " \ " mov ecx, eax " \ " and eax, 3 " \ " shr ecx, 2 " \ " rep movsd " \ "LEndBytes: add ecx, eax " \ " rep movsb " \ parm [ESI] [EDI] [ECX] \ modify [EAX ECX ESI EDI];

It has the nice feature of having only one jump which, in performance situations, you would expect to be rarely taken (hence well predicted; see Agner Fog's Pentium optimization article.) Note that this is not a suitable implementation of "memmove()" (see case #1)

Ken Silverman, who studied my loop, came up with a very clever simplifcation which (for reasons that are obvious from the discussion below) probably don't lead to appreciable performance improvements, however, is shorter.

void BlockCopy(char * Src, char * Dest, unsigned int Len); #pragma aux BlockCopy = \ " lea ecx, [edi+edi*2] " \ " and ecx, 3 " \ " sub eax, ecx " \ " jle short LEndBytes " \ " rep movsb " \ " mov ecx, eax " \ " and eax, 3 " \ " shr ecx, 2 " \ " rep movsd " \ " LEndBytes: add ecx, eax " \ " rep movsb " \ parm [ESI][EDI][EAX] \ modify exact [EAX ECX ESI EDI];

It uses the very clever observation that 3 == -1 (mod 4) which means multiplying by the 3 is the same as negating before an and with 3. (Looking at things mathematically can be useful!)

On a P5-166, I attempted to implement this idea using various "RISCified" copy loops but nothing yielded better performance (results on a 386 should be that rep movsd and rep movsw were fastest due to instruction prefetch buffer considerations.) Nor were they worse, which indicates to me that the memory bottleneck is really the only consideration. I would thus conclude that techniques such as "compiled bitmaps" would similarly yield no performance improvement (since the instructions must be read into the cache, the same way an ordinary loop loads its source.) Intel and others suggest that using the FPU to issue 64 bit moves directly would be even faster. I have only very recently tested this theory, and indeed have confirmed that using an unrolled QWORD based fild/fistp improves performance for HOST->HOST based transfers by a maximum of about 15% on a P5-166 (stop press! Pentiums with MMX showed an improvement of about 30% using FPU based 64 bit copies.) Charlie Wallace has made some sample code for various block copy/clears available, including some FPU based routines.

(Update: Intel has since updated and improved its chipset, as well as supporting a feature called "write-combining" in its P-II processors to make nearly all copy loops run with the same peak performance. However, for K6 microprocessors, it turns out that using MMX to move data 64 bits at a time is the fastest way to perform a block copy. Therefore, I would claim that using MMX if its available, is the fastest way, otherwise if you are using an Intel CPU use fild/fistp on QWORDs, and in all other cases use MOVSD as show in the above examples.)

A common technique for rendering that is used, is to perform all drawing in a memory buffer, then transfer the buffer to the destination graphics memory. (Under Windows, this would be the "CreateDIBSection" or "WinG" method, where device compatible DIBs are rendered to, then copied to the screen.) This is all based on the old assumption that writes to graphics memory are significantly slower than writes to system memory. But, as described above, for PCI based solutions with reasonable device memory systems (a good graphics card) this simply isn't true. Thus, the triumph of Direct Draw over WinG. (Direct Draw lets you write directly to the device memory, or use graphics accelerator commands to draw.)

I should comment that for those considering writing an optimized host based graphics library based on these ideas, should first consider using Graphics Accelerators (and see how they are programmed.)

Update: Certain revelations about certain Intel memory bus flaws have been discovered that may confuse matters still more. See this web page about it. I believe these anomolies were first noticed by Vesa Karvonen who discussed similar unusual results with me some time ago (I didn't know what to make of it since I was not able to reproduce his results on my very old Pentium which was probably more bottlenecked by an inability to burst at all.)

Update (12/04/99): Chipsets and CPUs these days are somewhat improved -- a preliminary improvement comes from moving to MMX:

// WATCOM C/C++ v11.0 required. void BlockCopy(char * Src, char * Dest, unsigned int Len); " mov eax, ecx " \ " sub ecx, edi " \ " sub ecx, eax " \ " and ecx, 7 " \ " sub eax, ecx " \ " jle short LEndBytes " \ " emms " \ " rep movsb " \ " mov ecx, eax " \ " and eax, 7 " \ " shr ecx, 3 " \ " jz LEndBytes " \ " sub edi, esi " \ "l1: movq mm0, [esi] " \ " movq [edi+esi], mm0 " \ " add esi,8 " \ " dec ecx " \ " jnz l1 " \ " add edi, esi " \ " emms " \ "LEndBytes: add ecx, eax " \ " rep movsb " \ parm [ESI] [EDI] [ECX] \ modify [EAX ECX ESI EDI];

The problem here, though is that we are being betrayed by decode bandwidth, so it would have to be unrolled further to get the best performance. However, as the complexity increases, its utility for lower counts decreases.

Intel claims that even further performance increases are possible by using (SSE based) prefetch instructions, however since the load and store bandwidth on the P6 core are basically the same, I don't quite see what they are driving at. I would suspect that an Athlon, which can perform two loads, but only one store per clock would be a much better candidate for using (3DNow! based) prefetch instructions.

Update (9/11/03): For the Athlon XP processor, using the xmm registers (SSE) leads to a dramatic performance improvement either in or out of the cache (don't ask me why/how):

mov edx, edi mov eax, ecx sub ecx, edi sub ecx, eax and ecx, 15 sub eax, ecx jle short LEndBytes rep movsb mov ecx, eax and eax, 15 shr ecx, 4 jz LEndBytes sub edi, esi l1: movups xmm0, [esi] movntps [edi+esi], xmm0 add esi, 16 dec ecx jnz l1 add edi, esi LEndBytes: add ecx, eax rep movsb

Oddly, it appears as though unrolling doesn't help.

It also appears as though AMD has written their own memcpy routine.

I have written a simple test hoist where I have tested a couple of memcpy loops.