Suppose we have a problem where we want to rotate the elements of an array so
that say, the index **i** holds the value that used to be at the position
**(i+k)%n**.

To perform an in-place array rotation using a second array for temporary storage is fairly straight-forward. Just copy the array to the temporary array, then copy it back to the original array with the shifted index. However, suppose we wish to avoid needing a second array (or we wanted to avoid the doubled memory bandwidth hit from creating a second array)?

The C++ Standard Template Library contains just such an implementation. However, its complicated, computes gcds, and divisions. Is all this really necessary? The general idea is to break down the rotation into a sequence of disjoint subrotations of the indexes that form a cycle (i, i+k%n, i+2*k%n, ...). For different (n, k) pairs, there are different cycle breakdowns. For example if gcd(n,k) is 1, then there is only one cycle, but if its not 1, there is necessarily more than one cycle. The STL library implementation works out the exact parameters of these cycles before hand then rotates these cycles one by one as necessary.

In the solution I present below, I am aware of and use this same break down, but there is no need to work out the exact configurations beforehand. I simply start performing the rotation of current cycle, then when I return the to the beginning, I determine where the next adjacent cycle must be and if there is in fact an adjacent cycle of incomplete work to be done. These computations can be done on the fly fairly easily with simple if-tests that will be well predicted in real world scenarios. The resulting loop is also extremely simple, and easy to verify.

The code reads from and writes to the **a[]** array exactly once each. So
this should lead to a minimal solution with high performance.

Another idea that is commonly seen in skill testing interview questions is using three reversals to perform the rotations: rev(1,k) rev(k+1,n-k) rev(1,n). This leads us to the following code:

This solution reads and writes each element twice, however, some quick performance testing on random arrays shows that this code is faster for very small sized baseTypes. The reason being that the reverse() is very good for memory access locality. However, for any larger sized baseTypes, the first code sequence given will win by the simple virtue of performing fewer operations.