aes: support `aes_armv8` on Rust 1.61+ using `asm!` #365

tarcieri · 2023-06-15T19:38:31Z

Adds "polyfills" for the unstable ARMv8 AES intrinsics using the asm! macro which was stabilized in Rust 1.59. However note we also need target_feature stabilizations for aes and neon which occurred in Rust 1.61.

Based on benchmarks this has no effect on performance, although it was necessary to place AESE/AESMC and AESD/AESIMC into a single asm! block in order to ensure that instructions fuse properly, as they did when using the proper intrinsics.

In the next breaking release, we should be able to get rid of the aes_armv8 configuration parameter entirely, bumping MSRV to 1.59 and then ARMv8 support should Just Work(TM) where available.

Performance appears to be unchanged.

Benchmarks (M1 Max)

Before

$ RUSTFLAGS="--cfg aes_armv8" cargo +nightly bench
   Compiling aes v0.8.2 (/Users/tony/src/RustCrypto/block-ciphers/aes)
    Finished bench [optimized] target(s) in 2.24s
     Running unittests src/lib.rs (/Users/tony/src/RustCrypto/block-ciphers/target/release/deps/aes-387e2e27bf9fddbd)

running 4 tests
test armv8::test_expand::aes128_key_expansion ... ignored
test armv8::test_expand::aes128_key_expansion_inv ... ignored
test armv8::test_expand::aes192_key_expansion ... ignored
test armv8::test_expand::aes256_key_expansion ... ignored

test result: ok. 0 passed; 0 failed; 4 ignored; 0 measured; 0 filtered out; finished in 0.00s

     Running benches/mod.rs (/Users/tony/src/RustCrypto/block-ciphers/target/release/deps/mod-8400f88c92c19848)

running 15 tests
test aes128_decrypt_block  ... bench:       1,059 ns/iter (+/- 8) = 15471 MB/s
test aes128_decrypt_blocks ... bench:       1,189 ns/iter (+/- 17) = 13779 MB/s
test aes128_encrypt_block  ... bench:       1,056 ns/iter (+/- 31) = 15515 MB/s
test aes128_encrypt_blocks ... bench:       1,192 ns/iter (+/- 86) = 13744 MB/s
test aes128_new            ... bench:         102 ns/iter (+/- 0)
test aes192_decrypt_block  ... bench:       1,510 ns/iter (+/- 10) = 10850 MB/s
test aes192_decrypt_blocks ... bench:       1,272 ns/iter (+/- 84) = 12880 MB/s
test aes192_encrypt_block  ... bench:       1,448 ns/iter (+/- 22) = 11314 MB/s
test aes192_encrypt_blocks ... bench:       1,276 ns/iter (+/- 24) = 12840 MB/s
test aes192_new            ... bench:         103 ns/iter (+/- 1)
test aes256_decrypt_block  ... bench:       1,728 ns/iter (+/- 31) = 9481 MB/s
test aes256_decrypt_blocks ... bench:       1,545 ns/iter (+/- 10) = 10604 MB/s
test aes256_encrypt_block  ... bench:       1,727 ns/iter (+/- 14) = 9486 MB/s
test aes256_encrypt_blocks ... bench:       1,547 ns/iter (+/- 23) = 10590 MB/s
test aes256_new            ... bench:         124 ns/iter (+/- 3)

test result: ok. 0 passed; 0 failed; 0 ignored; 15 measured; 0 filtered out; finished in 24.93s

After

$ RUSTFLAGS="--cfg aes_armv8" cargo +nightly bench
   Compiling aes v0.8.2 (/Users/tony/src/RustCrypto/block-ciphers/aes)
    Finished bench [optimized] target(s) in 1.07s
     Running unittests src/lib.rs (/Users/tony/src/RustCrypto/block-ciphers/target/release/deps/aes-eb55ebb5f7ca4ca0)

running 4 tests
test armv8::test_expand::aes128_key_expansion ... ignored
test armv8::test_expand::aes128_key_expansion_inv ... ignored
test armv8::test_expand::aes192_key_expansion ... ignored
test armv8::test_expand::aes256_key_expansion ... ignored

test result: ok. 0 passed; 0 failed; 4 ignored; 0 measured; 0 filtered out; finished in 0.00s

     Running benches/mod.rs (/Users/tony/src/RustCrypto/block-ciphers/target/release/deps/mod-6305260e7c7483e7)

running 15 tests
test aes128_decrypt_block  ... bench:       1,060 ns/iter (+/- 7) = 15456 MB/s
test aes128_decrypt_blocks ... bench:       1,025 ns/iter (+/- 13) = 15984 MB/s
test aes128_encrypt_block  ... bench:       1,059 ns/iter (+/- 7) = 15471 MB/s
test aes128_encrypt_blocks ... bench:       1,024 ns/iter (+/- 8) = 16000 MB/s
test aes128_new            ... bench:         101 ns/iter (+/- 4)
test aes192_decrypt_block  ... bench:       1,446 ns/iter (+/- 16) = 11330 MB/s
test aes192_decrypt_blocks ... bench:       1,271 ns/iter (+/- 25) = 12890 MB/s
test aes192_encrypt_block  ... bench:       1,447 ns/iter (+/- 18) = 11322 MB/s
test aes192_encrypt_blocks ... bench:       1,275 ns/iter (+/- 8) = 12850 MB/s
test aes192_new            ... bench:         103 ns/iter (+/- 5)
test aes256_decrypt_block  ... bench:       1,833 ns/iter (+/- 17) = 8938 MB/s
test aes256_decrypt_blocks ... bench:       1,569 ns/iter (+/- 9) = 10442 MB/s
test aes256_encrypt_block  ... bench:       1,727 ns/iter (+/- 19) = 9486 MB/s
test aes256_encrypt_blocks ... bench:       1,570 ns/iter (+/- 18) = 10435 MB/s
test aes256_new            ... bench:         124 ns/iter (+/- 1)

test result: ok. 0 passed; 0 failed; 0 ignored; 15 measured; 0 filtered out; finished in 21.59s

cc @codahale

tarcieri · 2023-06-15T19:46:24Z

Oh hmm, the target features were stabilized sometime after 1.59. Guess I'll have to figure out when.

Edit: looks like Rust 1.61.

Adds "polyfills" for the unstable ARMv8 AES intrinsics using the `asm!` macro which was stabilized in Rust 1.59. However note we also need `target_feature` stabilizations for `aes` and `neon` which occurred in Rust 1.61. Based on benchmarks this has no effect on performance, although it was necessary to place AESE/AESMC and AESD/AESIMC into a single `asm!` block in order to ensure that instructions fuse properly, as they did when using the proper intrinsics.

tarcieri · 2023-06-15T20:12:56Z

Weird, it wouldn't compile on aarch64-unknown-linux-gnu without 10debe5 which causes a performance regression.

Let me play around with it a bit more. It's working fine on aarch64-apple-darwin and I would think #[inline(always)] would work in place of target_feature.

newpavlov · 2023-06-15T20:27:30Z

Weird, it wouldn't compile on aarch64-unknown-linux-gnu without 10debe5 which causes a performance regression.

Huh, it's certainly looks weird. I don't think I've seen such behavior on x86. I would think that asm blocks should be completely opaque for compiler (apart from register allocation stuff). Maybe ask about it on IRLO or in a new Rust repository issue?

tarcieri · 2023-06-15T21:43:32Z

Yeah, seems like some sort of bug in rustc. You can see the build failure here:

https://github.com/RustCrypto/block-ciphers/actions/runs/5283017067/jobs/9558698657

Curiously it only seems to care about AESE and AESIMC (and even then, only sometimes). It doesn't seem to care about AESD and AESMC.

tarcieri · 2023-06-15T22:46:59Z

I'm pretty perplexed by the performance difference of 10debe5 too... it seems like unrelated code makes the performance greatly suffer

$ RUSTFLAGS="--cfg aes_armv8" cargo +nightly bench
   Compiling aes v0.8.2 (/Users/tony/src/RustCrypto/block-ciphers/aes)
    Finished bench [optimized] target(s) in 1.40s
     Running unittests src/lib.rs (/Users/tony/src/RustCrypto/block-ciphers/target/release/deps/aes-eb55ebb5f7ca4ca0)

running 4 tests
test armv8::test_expand::aes128_key_expansion ... ignored
test armv8::test_expand::aes128_key_expansion_inv ... ignored
test armv8::test_expand::aes192_key_expansion ... ignored
test armv8::test_expand::aes256_key_expansion ... ignored

test result: ok. 0 passed; 0 failed; 4 ignored; 0 measured; 0 filtered out; finished in 0.00s

     Running benches/mod.rs (/Users/tony/src/RustCrypto/block-ciphers/target/release/deps/mod-6305260e7c7483e7)

running 15 tests
test aes128_decrypt_block  ... bench:       1,060 ns/iter (+/- 9) = 15456 MB/s
test aes128_decrypt_blocks ... bench:       1,027 ns/iter (+/- 14) = 15953 MB/s
test aes128_encrypt_block  ... bench:       2,698 ns/iter (+/- 43) = 6072 MB/s
test aes128_encrypt_blocks ... bench:       3,606 ns/iter (+/- 170) = 4543 MB/s
test aes128_new            ... bench:         104 ns/iter (+/- 4)
test aes192_decrypt_block  ... bench:       1,447 ns/iter (+/- 14) = 11322 MB/s
test aes192_decrypt_blocks ... bench:       1,273 ns/iter (+/- 11) = 12870 MB/s
test aes192_encrypt_block  ... bench:       3,032 ns/iter (+/- 93) = 5403 MB/s
test aes192_encrypt_blocks ... bench:       3,607 ns/iter (+/- 172) = 4542 MB/s
test aes192_new            ... bench:         103 ns/iter (+/- 3)
test aes256_decrypt_block  ... bench:       1,755 ns/iter (+/- 85) = 9335 MB/s
test aes256_decrypt_blocks ... bench:       1,571 ns/iter (+/- 28) = 10429 MB/s
test aes256_encrypt_block  ... bench:       3,417 ns/iter (+/- 22) = 4794 MB/s
test aes256_encrypt_blocks ... bench:       4,148 ns/iter (+/- 268) = 3949 MB/s
test aes256_new            ... bench:         124 ns/iter (+/- 1)

test result: ok. 0 passed; 0 failed; 0 ignored; 15 measured; 0 filtered out; finished in 22.36s

newpavlov · 2023-06-15T22:58:18Z

My guess is that it fails to inline the "intrinsics", despite having #[target_feature(enable = "aes")] on their users. It could be worth to explore code generation in godbolt to find what exactly causes degradation.

tarcieri · 2023-06-16T00:14:56Z

@newpavlov what's really weird is that the performance of vaeseq_u8_and_vaesmcq_u8 is massively degraded, but it doesn't call either vaeseq_u8 or vaesimcq_u8, it has its own separate block of assembly.

Replacing #[inline(always)] with #[target_feature(enable = "aes")] is causing a performance degradation on seemingly unrelated code. Perhaps you're right that it's breaking inlining somehow, but the functions are only related by their contents (i.e. the same ASM instructions), not a direct call path.

tarcieri · 2023-06-16T01:32:06Z

I'm also having trouble extracting a minimal repro. Just encrypt1/encrypt8/decrypt1/decrypt8 and the intrinsics seem to be fine, so I guess it's something with the key expansion.

tarcieri · 2023-06-16T16:21:22Z

I opened a rustc issue here: rust-lang/rust#112709

This seems to fix the build failures we were experiencing here: rust-lang/rust#112709

tarcieri · 2023-06-16T18:07:15Z

Okay, it seems there were some callers in the key expansion code which weren't properly annotated with target_feature(enable = "aes") which were the culprit: c7fe62e

The OS-specific discrepancies were due to target_feature=+aes being enabled on certain targets but not others, with the error occurring where it wasn't enabled.

With that new commit we're green on all targets, and I've confirmed performance is not impacted, so this is ready for review.

tarcieri · 2023-06-16T19:45:13Z

@newpavlov hmm, perhaps we should use #[inline] + #[target_feature] instead of #[inline(always)] to avoid these hard-to-debug errors: rust-lang/stdarch#306

Edit: went ahead and did that in 26a070d

aes/src/armv8/intrinsics.rs

aes/src/armv8/expand.rs

Co-authored-by: Taiki Endo <te316e89@gmail.com>

tarcieri · 2023-06-17T12:51:02Z

Hmm, interesting, on an M2 Max the "parallel" versions seem to offer no speedup and in fact perform worse:

test aes128_decrypt_block  ... bench:         838 ns/iter (+/- 24) = 19551 MB/s
test aes128_decrypt_blocks ... bench:         851 ns/iter (+/- 9) = 19252 MB/s
test aes128_encrypt_block  ... bench:         845 ns/iter (+/- 46) = 19389 MB/s
test aes128_encrypt_blocks ... bench:         853 ns/iter (+/- 26) = 19207 MB/s
test aes128_new            ... bench:          64 ns/iter (+/- 4)
test aes192_decrypt_block  ... bench:         984 ns/iter (+/- 27) = 16650 MB/s
test aes192_decrypt_blocks ... bench:       1,004 ns/iter (+/- 46) = 16318 MB/s
test aes192_encrypt_block  ... bench:       1,002 ns/iter (+/- 65) = 16351 MB/s
test aes192_encrypt_blocks ... bench:       1,041 ns/iter (+/- 90) = 15738 MB/s
test aes192_new            ... bench:          62 ns/iter (+/- 3)
test aes256_decrypt_block  ... bench:       1,164 ns/iter (+/- 31) = 14075 MB/s
test aes256_decrypt_blocks ... bench:       1,349 ns/iter (+/- 105) = 12145 MB/s
test aes256_encrypt_block  ... bench:       1,180 ns/iter (+/- 186) = 13884 MB/s
test aes256_encrypt_blocks ... bench:       1,340 ns/iter (+/- 69) = 12226 MB/s
test aes256_new            ... bench:          84 ns/iter (+/- 5)

newpavlov · 2023-06-17T13:31:27Z

@tarcieri
Such micro-benchmarks are quite finicky, especially on sub-microsecond scales. The difference is within the measurement error, so I would say the performance is effectively the same. It could be interesting to measure performance of AES-CTR or even something like AES-GCM with and without explicit parallel block processing (i.e. with ParBlocksSize = U1).

Also note that the best number of blocks processed in parallel is processor-dependent. It depends not only on latency/throughput of used instructions, but also on number of available registers. So even within the same target arch it may vary. 8 blocks is the optimal number for mainstream x86 processors, while on certain older x86 CPUs 6 blocks would perform better. So on M1/M2 a different number could produce better results, or maybe we even don't need explicit parallel processing for it, since these CPUs are famously wide and have deep reordering buffers.

tarcieri · 2023-06-17T13:39:12Z

@newpavlov for AES-256 in particular it's outside the noise threshold, where the noise floor would be around 1,271 ns and the non-parallel version is around 1,180 ns. It's also fairly consistent and reproducible.

I added the parallelism because there was a noticeable benefit on earlier M1 CPUs however, which are much more widespread.

Anyway, just an observation.

newpavlov · 2023-06-17T14:00:39Z

Maybe try to experiment with different numbers of parallel blocks? It would be interesting to see a plot with such data for M1 and M2.

Also, maybe there are Apple recommendations for implementing AES similar to the Intel ones?

tarcieri requested a review from newpavlov June 15, 2023 19:38

tarcieri force-pushed the aes/stable-armv8-support branch from 4655fe4 to 64c7f15 Compare June 15, 2023 19:40

tarcieri force-pushed the aes/stable-armv8-support branch from 64c7f15 to 7818f35 Compare June 15, 2023 19:53

tarcieri changed the title ~~arm: support aes_armv8 on Rust 1.59+ using asm!~~ arm: support aes_armv8 on Rust 1.61+ using asm! Jun 15, 2023

tarcieri marked this pull request as draft June 15, 2023 20:13

tarcieri mentioned this pull request Jun 16, 2023

inline ASM requires target features on some targets rust-lang/rust#112709

Closed

tarcieri force-pushed the aes/stable-armv8-support branch from 10debe5 to cc70182 Compare June 16, 2023 17:49

Annotate outer AES functions with target_feature

c7fe62e

This seems to fix the build failures we were experiencing here: rust-lang/rust#112709

tarcieri force-pushed the aes/stable-armv8-support branch from cc70182 to c7fe62e Compare June 16, 2023 17:53

tarcieri marked this pull request as ready for review June 16, 2023 18:04

target_feature cleanups

26a070d

tarcieri changed the title ~~arm: support aes_armv8 on Rust 1.61+ using asm!~~ aes: support aes_armv8 on Rust 1.61+ using asm! Jun 16, 2023

taiki-e reviewed Jun 17, 2023

View reviewed changes

aes/src/armv8/intrinsics.rs Show resolved Hide resolved

aes/src/armv8/expand.rs Show resolved Hide resolved

Update aes/src/armv8/intrinsics.rs

1e20f5f

Co-authored-by: Taiki Endo <te316e89@gmail.com>

newpavlov approved these changes Jun 17, 2023

View reviewed changes

tarcieri merged commit 8d03900 into master Jun 17, 2023
25 checks passed

tarcieri deleted the aes/stable-armv8-support branch June 17, 2023 14:26

tarcieri mentioned this pull request Jun 17, 2023

aes v0.8.3 #368

Merged

tarcieri mentioned this pull request Aug 24, 2023

aes: replace inline ASM with ARMv8 intrinsics #380

Merged

andyleiserson mentioned this pull request Oct 19, 2023

Update aes crate and enable intrinsics on aarch64 private-attribution/ipa#810

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

aes: support `aes_armv8` on Rust 1.61+ using `asm!` #365

aes: support `aes_armv8` on Rust 1.61+ using `asm!` #365

tarcieri commented Jun 15, 2023 •

edited

Loading

tarcieri commented Jun 15, 2023 •

edited

Loading

tarcieri commented Jun 15, 2023

newpavlov commented Jun 15, 2023

tarcieri commented Jun 15, 2023 •

edited

Loading

tarcieri commented Jun 15, 2023

newpavlov commented Jun 15, 2023 •

edited

Loading

tarcieri commented Jun 16, 2023 •

edited

Loading

tarcieri commented Jun 16, 2023

tarcieri commented Jun 16, 2023

tarcieri commented Jun 16, 2023

tarcieri commented Jun 16, 2023 •

edited

Loading

tarcieri commented Jun 17, 2023

newpavlov commented Jun 17, 2023 •

edited

Loading

tarcieri commented Jun 17, 2023

newpavlov commented Jun 17, 2023

aes: support aes_armv8 on Rust 1.61+ using asm! #365

aes: support aes_armv8 on Rust 1.61+ using asm! #365

Conversation

tarcieri commented Jun 15, 2023 • edited Loading

Benchmarks (M1 Max)

Before

After

tarcieri commented Jun 15, 2023 • edited Loading

tarcieri commented Jun 15, 2023

newpavlov commented Jun 15, 2023

tarcieri commented Jun 15, 2023 • edited Loading

tarcieri commented Jun 15, 2023

newpavlov commented Jun 15, 2023 • edited Loading

tarcieri commented Jun 16, 2023 • edited Loading

tarcieri commented Jun 16, 2023

tarcieri commented Jun 16, 2023

tarcieri commented Jun 16, 2023

tarcieri commented Jun 16, 2023 • edited Loading

tarcieri commented Jun 17, 2023

newpavlov commented Jun 17, 2023 • edited Loading

tarcieri commented Jun 17, 2023

newpavlov commented Jun 17, 2023

aes: support `aes_armv8` on Rust 1.61+ using `asm!` #365

aes: support `aes_armv8` on Rust 1.61+ using `asm!` #365

tarcieri commented Jun 15, 2023 •

edited

Loading

tarcieri commented Jun 15, 2023 •

edited

Loading

tarcieri commented Jun 15, 2023 •

edited

Loading

newpavlov commented Jun 15, 2023 •

edited

Loading

tarcieri commented Jun 16, 2023 •

edited

Loading

tarcieri commented Jun 16, 2023 •

edited

Loading

newpavlov commented Jun 17, 2023 •

edited

Loading