What A Stray Mind Coughed Up: 2015

There are many articles out there describing hardware accelerated AES, which has been out for a few years now, but SHA instructions are not out yet, but should be appearing in ARM and x86 instruction sets sometime this year. So I figure it's time to start talking about them now. The first and formost standard for the SHA-1 algorithm is NIST's current FIPS-180 document, specifically section 6.1.2 "SHA-1 Hash Computation" part 3, which says:

T = ROTL₅(a) + f_t(b, c, d) + e + K_t + W_t

This formula (which I call the T-formula) is probably the most computationally intensive part of SHA-1, and one of the most surprising things about my research into the ARM NEON and x86 SHA instruction sets is that neither of them implement this formula exactly as is.

SHA-1 Instructions

ARM	x86	Comment
`sha1h a`		Rotate left 30
`sha1su0 a, b, c`	`sha1msg1 a, b`	Message schedule part 1
`sha1su1 a, b`	`sha1msg2 a, b`	Message schedule part 2
`sha1c hash, e, msg`	`sha1rnds4 hash, msg, 0`	Rounds 0 to 20
`sha1p hash, e, msg`	`sha1rnds4 hash, msg, 1`	Rounds 20 to 40
`sha1m hash, e, msg`	`sha1rnds4 hash, msg, 2`	Rounds 40 to 60
`sha1p hash, e, msg`	`sha1rnds4 hash, msg, 3`	Rounds 60 to 80
	`sha1nexte hash, msg`	Invisible e operand

As you can see, ARM and x86 are a bit different. From my research, it seems that ARM omits "K_t" from the T-formula, and x86 omits "e" from the T-formula. These implementation choices imply that SHA-1 optimizations on each architecture will use these instructions very differently. x86 has a publication documenting the SHA instruction set extensions, and if there are any conflicts with this article, then the final word is there. X86 omits "e" in the T-formula, so it must be calculated with the sha1nexte instruction. In pseudo-code, a representative sample of 4 SHA-1 rounds on x86 can be computed by:

work18 = sha1msg2(sha1msg1(work14, work15) ^ work16, work17);
hash19 = sha1rnds4(hash18, sha1nexte(hash17, work18), 3);

ARM on the other hand, has absolutely no documentation about their SHA instruction set, except how to determine if the processor supports it. While this is handy, it does very little to describe the inner workings of each instruction. So the following is just a guess on my part, but given that the inputs to (sha1c, sha1p, and sha1m) are two vectors, there is enough information to compute 4 rounds of SHA-1, just like the x86 sha1rnds4 instruction. ARM omits "K_t" in the T-formula, so it must be added to the work vector. In pseudo-code, a representative sample of 4 SHA-1 rounds on ARM can be computed by:

work18 = sha1su1(sha1su0(work14, work15, work16), work17);
hash19 = sha1p(hash18, sha1h(hash17[0]), work18 + SHA1_K3V);

where SHA1_K3V is a vector of 4 uint32's all of which are SHA1_K3.

For more information see also: sha1.rs

What A Stray Mind Coughed Up

Tuesday, January 27, 2015

Hardware accelerated SHA-1

SHA-1 Instructions