Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

aes: support aes_armv8 on Rust 1.61+ using asm! #365

Merged
merged 4 commits into from
Jun 17, 2023

Conversation

tarcieri
Copy link
Member

@tarcieri tarcieri commented Jun 15, 2023

Adds "polyfills" for the unstable ARMv8 AES intrinsics using the asm! macro which was stabilized in Rust 1.59. However note we also need target_feature stabilizations for aes and neon which occurred in Rust 1.61.

Based on benchmarks this has no effect on performance, although it was necessary to place AESE/AESMC and AESD/AESIMC into a single asm! block in order to ensure that instructions fuse properly, as they did when using the proper intrinsics.

In the next breaking release, we should be able to get rid of the aes_armv8 configuration parameter entirely, bumping MSRV to 1.59 and then ARMv8 support should Just Work(TM) where available.

Performance appears to be unchanged.

Benchmarks (M1 Max)

Before

$ RUSTFLAGS="--cfg aes_armv8" cargo +nightly bench
   Compiling aes v0.8.2 (/Users/tony/src/RustCrypto/block-ciphers/aes)
    Finished bench [optimized] target(s) in 2.24s
     Running unittests src/lib.rs (/Users/tony/src/RustCrypto/block-ciphers/target/release/deps/aes-387e2e27bf9fddbd)

running 4 tests
test armv8::test_expand::aes128_key_expansion ... ignored
test armv8::test_expand::aes128_key_expansion_inv ... ignored
test armv8::test_expand::aes192_key_expansion ... ignored
test armv8::test_expand::aes256_key_expansion ... ignored

test result: ok. 0 passed; 0 failed; 4 ignored; 0 measured; 0 filtered out; finished in 0.00s

     Running benches/mod.rs (/Users/tony/src/RustCrypto/block-ciphers/target/release/deps/mod-8400f88c92c19848)

running 15 tests
test aes128_decrypt_block  ... bench:       1,059 ns/iter (+/- 8) = 15471 MB/s
test aes128_decrypt_blocks ... bench:       1,189 ns/iter (+/- 17) = 13779 MB/s
test aes128_encrypt_block  ... bench:       1,056 ns/iter (+/- 31) = 15515 MB/s
test aes128_encrypt_blocks ... bench:       1,192 ns/iter (+/- 86) = 13744 MB/s
test aes128_new            ... bench:         102 ns/iter (+/- 0)
test aes192_decrypt_block  ... bench:       1,510 ns/iter (+/- 10) = 10850 MB/s
test aes192_decrypt_blocks ... bench:       1,272 ns/iter (+/- 84) = 12880 MB/s
test aes192_encrypt_block  ... bench:       1,448 ns/iter (+/- 22) = 11314 MB/s
test aes192_encrypt_blocks ... bench:       1,276 ns/iter (+/- 24) = 12840 MB/s
test aes192_new            ... bench:         103 ns/iter (+/- 1)
test aes256_decrypt_block  ... bench:       1,728 ns/iter (+/- 31) = 9481 MB/s
test aes256_decrypt_blocks ... bench:       1,545 ns/iter (+/- 10) = 10604 MB/s
test aes256_encrypt_block  ... bench:       1,727 ns/iter (+/- 14) = 9486 MB/s
test aes256_encrypt_blocks ... bench:       1,547 ns/iter (+/- 23) = 10590 MB/s
test aes256_new            ... bench:         124 ns/iter (+/- 3)

test result: ok. 0 passed; 0 failed; 0 ignored; 15 measured; 0 filtered out; finished in 24.93s

After

$ RUSTFLAGS="--cfg aes_armv8" cargo +nightly bench
   Compiling aes v0.8.2 (/Users/tony/src/RustCrypto/block-ciphers/aes)
    Finished bench [optimized] target(s) in 1.07s
     Running unittests src/lib.rs (/Users/tony/src/RustCrypto/block-ciphers/target/release/deps/aes-eb55ebb5f7ca4ca0)

running 4 tests
test armv8::test_expand::aes128_key_expansion ... ignored
test armv8::test_expand::aes128_key_expansion_inv ... ignored
test armv8::test_expand::aes192_key_expansion ... ignored
test armv8::test_expand::aes256_key_expansion ... ignored

test result: ok. 0 passed; 0 failed; 4 ignored; 0 measured; 0 filtered out; finished in 0.00s

     Running benches/mod.rs (/Users/tony/src/RustCrypto/block-ciphers/target/release/deps/mod-6305260e7c7483e7)

running 15 tests
test aes128_decrypt_block  ... bench:       1,060 ns/iter (+/- 7) = 15456 MB/s
test aes128_decrypt_blocks ... bench:       1,025 ns/iter (+/- 13) = 15984 MB/s
test aes128_encrypt_block  ... bench:       1,059 ns/iter (+/- 7) = 15471 MB/s
test aes128_encrypt_blocks ... bench:       1,024 ns/iter (+/- 8) = 16000 MB/s
test aes128_new            ... bench:         101 ns/iter (+/- 4)
test aes192_decrypt_block  ... bench:       1,446 ns/iter (+/- 16) = 11330 MB/s
test aes192_decrypt_blocks ... bench:       1,271 ns/iter (+/- 25) = 12890 MB/s
test aes192_encrypt_block  ... bench:       1,447 ns/iter (+/- 18) = 11322 MB/s
test aes192_encrypt_blocks ... bench:       1,275 ns/iter (+/- 8) = 12850 MB/s
test aes192_new            ... bench:         103 ns/iter (+/- 5)
test aes256_decrypt_block  ... bench:       1,833 ns/iter (+/- 17) = 8938 MB/s
test aes256_decrypt_blocks ... bench:       1,569 ns/iter (+/- 9) = 10442 MB/s
test aes256_encrypt_block  ... bench:       1,727 ns/iter (+/- 19) = 9486 MB/s
test aes256_encrypt_blocks ... bench:       1,570 ns/iter (+/- 18) = 10435 MB/s
test aes256_new            ... bench:         124 ns/iter (+/- 1)

test result: ok. 0 passed; 0 failed; 0 ignored; 15 measured; 0 filtered out; finished in 21.59s

cc @codahale

@tarcieri
Copy link
Member Author

tarcieri commented Jun 15, 2023

Oh hmm, the target features were stabilized sometime after 1.59. Guess I'll have to figure out when.

Edit: looks like Rust 1.61.

Adds "polyfills" for the unstable ARMv8 AES intrinsics using the `asm!`
macro which was stabilized in Rust 1.59. However note we also need
`target_feature` stabilizations for `aes` and `neon` which occurred in
Rust 1.61.

Based on benchmarks this has no effect on performance, although it was
necessary to place AESE/AESMC and AESD/AESIMC into a single `asm!` block
in order to ensure that instructions fuse properly, as they did when
using the proper intrinsics.
@tarcieri tarcieri changed the title arm: support aes_armv8 on Rust 1.59+ using asm! arm: support aes_armv8 on Rust 1.61+ using asm! Jun 15, 2023
@tarcieri
Copy link
Member Author

Weird, it wouldn't compile on aarch64-unknown-linux-gnu without 10debe5 which causes a performance regression.

Let me play around with it a bit more. It's working fine on aarch64-apple-darwin and I would think #[inline(always)] would work in place of target_feature.

@tarcieri tarcieri marked this pull request as draft June 15, 2023 20:13
@newpavlov
Copy link
Member

Weird, it wouldn't compile on aarch64-unknown-linux-gnu without 10debe5 which causes a performance regression.

Huh, it's certainly looks weird. I don't think I've seen such behavior on x86. I would think that asm blocks should be completely opaque for compiler (apart from register allocation stuff). Maybe ask about it on IRLO or in a new Rust repository issue?

@tarcieri
Copy link
Member Author

tarcieri commented Jun 15, 2023

Yeah, seems like some sort of bug in rustc. You can see the build failure here:

https://github.com/RustCrypto/block-ciphers/actions/runs/5283017067/jobs/9558698657

Curiously it only seems to care about AESE and AESIMC (and even then, only sometimes). It doesn't seem to care about AESD and AESMC.

@tarcieri
Copy link
Member Author

I'm pretty perplexed by the performance difference of 10debe5 too... it seems like unrelated code makes the performance greatly suffer

$ RUSTFLAGS="--cfg aes_armv8" cargo +nightly bench
   Compiling aes v0.8.2 (/Users/tony/src/RustCrypto/block-ciphers/aes)
    Finished bench [optimized] target(s) in 1.40s
     Running unittests src/lib.rs (/Users/tony/src/RustCrypto/block-ciphers/target/release/deps/aes-eb55ebb5f7ca4ca0)

running 4 tests
test armv8::test_expand::aes128_key_expansion ... ignored
test armv8::test_expand::aes128_key_expansion_inv ... ignored
test armv8::test_expand::aes192_key_expansion ... ignored
test armv8::test_expand::aes256_key_expansion ... ignored

test result: ok. 0 passed; 0 failed; 4 ignored; 0 measured; 0 filtered out; finished in 0.00s

     Running benches/mod.rs (/Users/tony/src/RustCrypto/block-ciphers/target/release/deps/mod-6305260e7c7483e7)

running 15 tests
test aes128_decrypt_block  ... bench:       1,060 ns/iter (+/- 9) = 15456 MB/s
test aes128_decrypt_blocks ... bench:       1,027 ns/iter (+/- 14) = 15953 MB/s
test aes128_encrypt_block  ... bench:       2,698 ns/iter (+/- 43) = 6072 MB/s
test aes128_encrypt_blocks ... bench:       3,606 ns/iter (+/- 170) = 4543 MB/s
test aes128_new            ... bench:         104 ns/iter (+/- 4)
test aes192_decrypt_block  ... bench:       1,447 ns/iter (+/- 14) = 11322 MB/s
test aes192_decrypt_blocks ... bench:       1,273 ns/iter (+/- 11) = 12870 MB/s
test aes192_encrypt_block  ... bench:       3,032 ns/iter (+/- 93) = 5403 MB/s
test aes192_encrypt_blocks ... bench:       3,607 ns/iter (+/- 172) = 4542 MB/s
test aes192_new            ... bench:         103 ns/iter (+/- 3)
test aes256_decrypt_block  ... bench:       1,755 ns/iter (+/- 85) = 9335 MB/s
test aes256_decrypt_blocks ... bench:       1,571 ns/iter (+/- 28) = 10429 MB/s
test aes256_encrypt_block  ... bench:       3,417 ns/iter (+/- 22) = 4794 MB/s
test aes256_encrypt_blocks ... bench:       4,148 ns/iter (+/- 268) = 3949 MB/s
test aes256_new            ... bench:         124 ns/iter (+/- 1)

test result: ok. 0 passed; 0 failed; 0 ignored; 15 measured; 0 filtered out; finished in 22.36s

@newpavlov
Copy link
Member

newpavlov commented Jun 15, 2023

My guess is that it fails to inline the "intrinsics", despite having #[target_feature(enable = "aes")] on their users. It could be worth to explore code generation in godbolt to find what exactly causes degradation.

@tarcieri
Copy link
Member Author

tarcieri commented Jun 16, 2023

@newpavlov what's really weird is that the performance of vaeseq_u8_and_vaesmcq_u8 is massively degraded, but it doesn't call either vaeseq_u8 or vaesimcq_u8, it has its own separate block of assembly.

Replacing #[inline(always)] with #[target_feature(enable = "aes")] is causing a performance degradation on seemingly unrelated code. Perhaps you're right that it's breaking inlining somehow, but the functions are only related by their contents (i.e. the same ASM instructions), not a direct call path.

@tarcieri
Copy link
Member Author

I'm also having trouble extracting a minimal repro. Just encrypt1/encrypt8/decrypt1/decrypt8 and the intrinsics seem to be fine, so I guess it's something with the key expansion.

@tarcieri
Copy link
Member Author

I opened a rustc issue here: rust-lang/rust#112709

This seems to fix the build failures we were experiencing here:

rust-lang/rust#112709
@tarcieri tarcieri marked this pull request as ready for review June 16, 2023 18:04
@tarcieri
Copy link
Member Author

Okay, it seems there were some callers in the key expansion code which weren't properly annotated with target_feature(enable = "aes") which were the culprit: c7fe62e

The OS-specific discrepancies were due to target_feature=+aes being enabled on certain targets but not others, with the error occurring where it wasn't enabled.

With that new commit we're green on all targets, and I've confirmed performance is not impacted, so this is ready for review.

@tarcieri
Copy link
Member Author

tarcieri commented Jun 16, 2023

@newpavlov hmm, perhaps we should use #[inline] + #[target_feature] instead of #[inline(always)] to avoid these hard-to-debug errors: rust-lang/stdarch#306

Edit: went ahead and did that in 26a070d

@tarcieri tarcieri changed the title arm: support aes_armv8 on Rust 1.61+ using asm! aes: support aes_armv8 on Rust 1.61+ using asm! Jun 16, 2023
aes/src/armv8/intrinsics.rs Show resolved Hide resolved
aes/src/armv8/expand.rs Show resolved Hide resolved
Co-authored-by: Taiki Endo <te316e89@gmail.com>
@tarcieri
Copy link
Member Author

Hmm, interesting, on an M2 Max the "parallel" versions seem to offer no speedup and in fact perform worse:

test aes128_decrypt_block  ... bench:         838 ns/iter (+/- 24) = 19551 MB/s
test aes128_decrypt_blocks ... bench:         851 ns/iter (+/- 9) = 19252 MB/s
test aes128_encrypt_block  ... bench:         845 ns/iter (+/- 46) = 19389 MB/s
test aes128_encrypt_blocks ... bench:         853 ns/iter (+/- 26) = 19207 MB/s
test aes128_new            ... bench:          64 ns/iter (+/- 4)
test aes192_decrypt_block  ... bench:         984 ns/iter (+/- 27) = 16650 MB/s
test aes192_decrypt_blocks ... bench:       1,004 ns/iter (+/- 46) = 16318 MB/s
test aes192_encrypt_block  ... bench:       1,002 ns/iter (+/- 65) = 16351 MB/s
test aes192_encrypt_blocks ... bench:       1,041 ns/iter (+/- 90) = 15738 MB/s
test aes192_new            ... bench:          62 ns/iter (+/- 3)
test aes256_decrypt_block  ... bench:       1,164 ns/iter (+/- 31) = 14075 MB/s
test aes256_decrypt_blocks ... bench:       1,349 ns/iter (+/- 105) = 12145 MB/s
test aes256_encrypt_block  ... bench:       1,180 ns/iter (+/- 186) = 13884 MB/s
test aes256_encrypt_blocks ... bench:       1,340 ns/iter (+/- 69) = 12226 MB/s
test aes256_new            ... bench:          84 ns/iter (+/- 5)

@newpavlov
Copy link
Member

newpavlov commented Jun 17, 2023

@tarcieri
Such micro-benchmarks are quite finicky, especially on sub-microsecond scales. The difference is within the measurement error, so I would say the performance is effectively the same. It could be interesting to measure performance of AES-CTR or even something like AES-GCM with and without explicit parallel block processing (i.e. with ParBlocksSize = U1).

Also note that the best number of blocks processed in parallel is processor-dependent. It depends not only on latency/throughput of used instructions, but also on number of available registers. So even within the same target arch it may vary. 8 blocks is the optimal number for mainstream x86 processors, while on certain older x86 CPUs 6 blocks would perform better. So on M1/M2 a different number could produce better results, or maybe we even don't need explicit parallel processing for it, since these CPUs are famously wide and have deep reordering buffers.

@tarcieri
Copy link
Member Author

@newpavlov for AES-256 in particular it's outside the noise threshold, where the noise floor would be around 1,271 ns and the non-parallel version is around 1,180 ns. It's also fairly consistent and reproducible.

I added the parallelism because there was a noticeable benefit on earlier M1 CPUs however, which are much more widespread.

Anyway, just an observation.

@newpavlov
Copy link
Member

Maybe try to experiment with different numbers of parallel blocks? It would be interesting to see a plot with such data for M1 and M2.

Also, maybe there are Apple recommendations for implementing AES similar to the Intel ones?

@tarcieri tarcieri merged commit 8d03900 into master Jun 17, 2023
25 checks passed
@tarcieri tarcieri deleted the aes/stable-armv8-support branch June 17, 2023 14:26
@tarcieri tarcieri mentioned this pull request Jun 17, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants