AltiVec Optimization for Skein

Skein is hash function. See http://skein-hash.info for more information.

This is an optimized version of the Skein routine for 32bit PowerPC processors that contain an AltiVec unit (also refered to as VMX or Velocity Engine).

As Skein makes extensive use of 64bit operations, these need to be emulated using the 128bit vectors that contain four 32bit words. Unfortunately this causes a performance penalty for the algorithm. Another disadvantage for the PPC implementation is that Skein uses little endian words, so the data needs to be converted to big endian for the calculations.

All flavours, 256, 512 and 1024 bit are supported. Different implementations are used for rotations depending on which was fastest.

Getting It

You can get the altivec implementation using git or browse the code.

git clone https://git.sipsolutions.net/skein-altivec.git

The actual implementation is in skein_block.c

Speed

In performance tests on a PowerBook G4 1.6 GHz (7447A), the code (optimized for the G4) achieves a hash performance of about 35 MByte/s, or about 45 cycles/byte. This is more than 10 times faster than the simple non-optimized code that GCC creates.

The 256 and 1024 bit implementations are both slightly slower than the 512 bit implementation.

Implementation Details

(Just copied from skein_block.c.)

Altivec has 32 128 bit registers, which can be used as 8, 16 or 32 bit integers, or 32 bit floating point values. Altivec does not support 64bit operations natively, but we can still do two 64 bit calculations at the same time with some tricks.

The important part here is that the state has two different kinds of words. There are eight 64 bit values, but 4 (the even ones) are always used as the target for the addition, and the other 4 (the odd ones) are the target of the xor (and are shifted).

Now this does not lend to the way that altivec works, because an even and odd value would end up in the same register, and we cannot perform the same operation on both words. However, we can change the order (Brackets show which values are together in a vector.):

Instead of:

(0, 1), (2, 3), (4, 5), (6, 7)

we use:

(0, 2), (4, 6), (1, 3), (5, 7)

Now the old algorithm:

0 = 0 + 1;
2 = 2 + 3;
rotate(1, a);
rotate(3, b);
1 = 0 ^ 1
3 = 2 ^ 3

can be replaced with:

v0 = v0 + v2;
rotate(v2, a, b);
v0 = v0 ^ v2;

If we now look at the different steps, we notice that the calculations always look the same. The words in the first two vectors are always the destination of the addition, and the words of the last two vectors the destination for the xor and rotate. The difference between each run is that the words in the last two vectors are swapped.

So in the first round the vectors are:

(0, 2), (4, 6), (1, 3), (5, 7)

second:

(0, 2), (4, 6), (3, 1), (7, 5)

third:

(0, 2), (4, 6), (5, 7), (1, 3)

fourth:

(0, 2), (4, 6), (7, 5), (3, 1)

key injection (which needs to be adjusted because of the different order):

(0, 2), (4, 6), (1, 3), (5, 7)

As one can see, it is only neccessary to permute the values in the last two vectors. After that the same code can be run again for the next round.