Swapping variables.

Often times I see "clever" programmers recite the following C macro as a simple method for swapping variables:
#define swap(a,b) { \ (a) ^= (b); \ (b) ^= (a); \ (a) ^= (b); \ }
And if you work it all out it seems to be ok. I find this code wanting in the area of performance and clarity. But more importantly, it is flawed. swap(x,x) (i.e., aliasing the parameters) will send the value to 0 no matter what the input value x. The much clearer version of the same thing does not suffer from this problem:
#define swap(a,b) { \ int c = (a); \ (a) = (b); \ (b) = c; \ }
This method will also tend to yield high performance code. (On advanced CPU designs, the processor will automatically forward the results of the initially over constrained (a) variable, if all are registers, allowing maximal parallelism. With the first solution, forwarding cannot be used since 3 sequential, interdependent ALU accesses are required, but fewer registers are required.)
On the other hand, if the swap operation is not just on variables, but on huge blocks of data, (and there is potential acceleration from an asynchronous or vectorizable piece of hardware) and you are memory constrained obviously the first method may be more appealing. (Thanks to Christer Ericson for pointing this out.)
One must not be close minded to alternatives, however. Consider the following variation on the swap function:
#define swap(a,b) { \ (a) += (b); \ (b) = (a) - (b); \ (a) -= (b); \ }
(Again it suffers from the swap(x,x) problem, and sequential interdependent ALU accesses.) Now suppose you are asked to implement a function to compute successive terms of the fibonacci sequence (defined as fib(n+2):=fib(n+1)+fib(n), fib(1)=fib(2)=1) The inner loop is nothing but adding and swapping, so by using the swap procedure above and performing some arithmetic simplifications one ought to be able to simplify the code significantly. (On the other hand, as pointed out by a couple readers, unrolling the fibonacci loop once will yield even higher typical performance, while retaining most of the simplicity.)
Finally, for people in search of the most break neck performance, knowing thy architecture can be very helpful. For the K6 and PPro/P-II x86 CPUs, the xchg assembly language instruction is the fastest way to swap registers (but it's very slow for swapping memory locations, due to an implicit memory access lock.) So for the WATCOM C/C++ compiler we have:
int asm_swap1 (int a, int b); int asm_swap2 (void); #pragma aux asm_swap1 = \ " xchg ebx,edx " \ parm [ebx] [edx] value [ebx]; #pragma aux asm_swap2 = "" \ value [edx]; #define swap(a,b) { \ (a) = asm_swap1 ((a), (b)); \ (b) = asm_swap2 (); \ }
A similarly fast implementation exists for Gnu compilers, however achieving a similar result using Microsoft's or Borland's compilers is not so easy without introducing extra clocks that defeat the point of the exercise.

Swapping arrays

"Mark" emailed me to ask how one might extend this to swapping arrays. His starting point was:

	memcpy (t, a, size);
	memcpy (a, b, size);
	memcpy (b, t, size);

Norbert Juffa pointed out to me that if size is a runtime determined value and smaller than some threshold, then there may not be much improvement to be had. If size is a small compile time constant, then we can use a struct to let the compiler optimize this copy at compile time:

	struct arr_s {
	    char element[size];
	} t;

	                  t = *(struct arr_s *) a;
	*(struct arr_s *) a = *(struct arr_s *) b;
	*(struct arr_s *) b = t;

The compiler can assume that a, b, t are aligned to sizeof (struct arr_s) boundaries and therefore assume that there is no non-trivial aliasing and thus use a platform optimized block copy instrinsic to perform the swap.

Another concern, however, is that different types can be treated internally in your CPU with different parts of the CPU (specifically the FPU and Integer parts of your CPU can typically be very far seperated, so you should not use integer mechanisms to swap floating point values since it will be slower, for example.). If you are able to create the type wrapper as shown above, then you can likely do something like the following:

	#define swapType (type, x, y) {                         \
	    type tmp_NAMESPACE_OBFUSCATION_XJgb75b6978t = (x);  \
	    (x) = (y);                                          \
	    (y) = tmp_NAMESPACE_OBFUSCATION_XJgb75b6978t;       \
	}

So if you want to swap a fixed array of doubles:

	struct arr_ds {
	    double element[size];
	};

	swapType (struct arr_ds, *(struct arr_ds *)&a, *(struct arr_ds *)&b);

And the compiler will be able to use type specific semantics to make sure this operation is as fast as possible.

So lets take a second to discuss aliasing. Well, we won't need more than a second ... if x and y are identical, then as long as you avoid the analytical ways of swapping (i.e., using the difference of sums, or the xor trick) then any method will work. If you are concerned with performance put an if() test at the beginning. But if the target buffers are not identical but overlap then in general, there is no final representation which is guaranteed to be swapped versions of the original. The overlapped region is at the end of one buffer, and the beginning of the other -- it is trivial to come up with any old random example to see what's wrong:

	double x[2];
	double * y = &x[1];
	x[0] = 1;
	x[1] = 2;	/* x = {1,2}; */
	y[1] = 3;	/* y = {2,3}; */

	/* There's no way to make x = {2,3} and y = {1,2} */
	/* because x[1] and y[0] are aliased.             */

As you can see, its not the way in which the aliasing occurs that matter, but just the values that are in the arrays that prevent this from working. So in general, a swap operation simply cannot address any non-trvially aliased parameters.

This situation is actually a good thing in the sense that we don't have to concentrate any effort of the question of what to do about aliasing at all. So we can use a simple temporary block solution for large sizes.

	#define BLOCK_SIZE (64) /* A cache line? */

	struct arr_bs {
	    unsigned char element[BLOCK_SIZE];
	};

	void bigSwap (void * d0, void * d1, size_t sz) {
	    int i, l = sz / BLOCK_SIZE;
	    for (i=0; i < l i++) {
	        swapType (arr_bs, ((struct arr_bs *) d0)[i], 
	                          ((struct arr_bs *) d1)[i]);
	    }
	    d0 = &(((struct arr_bs *) d0)[i]);
	    d1 = &(((struct arr_bs *) d1)[i]);
	    sz %= (unsigned int) BLOCK_SIZE;
	    for (i=0; i < sz; i++) {
	        swapType (unsigned char, ((unsigned char *) d0)[i],
	                                 ((unsigned char *) d1)[i]);
            }
	}

The point being that once the arrays get large enough, paying attention to the type no longer matters, since we will be memory limited. The activity of optimization the above will be in picking the right BLOCK_SIZE -- for example on some CPU platforms it is a good idea to pick it with an eye on cache line sizes, however, on other CPUs its possible that the temporary can be allocated totally in registers.

On a related subject ...

If you want to toggle a variable between two values you might write something like:

    if (r == value1) 
        r = value2;
    else
        r = value1;

but modern CPUs tend to be slow at conditional branching that is hard to predict and you are likely to be better served by:

    r = (value1 + value2) - r;

    r = r ^ (value1 ^ value2);

Both of the above require correct initialization of the variable in question. On most machines the performance of each should be the same, however on inherently in place CPUs such as the 80x86 family of processors, the second will be an instruction shorter for integer variable as well as being faster (the opposite is clearly true for floating point.) Notice the similarity between the swap and toggle problems.