-
Notifications
You must be signed in to change notification settings - Fork 130
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
aes: support aes_armv8
on Rust 1.61+ using asm!
#365
Conversation
4655fe4
to
64c7f15
Compare
Oh hmm, the target features were stabilized sometime after 1.59. Guess I'll have to figure out when. Edit: looks like Rust 1.61. |
Adds "polyfills" for the unstable ARMv8 AES intrinsics using the `asm!` macro which was stabilized in Rust 1.59. However note we also need `target_feature` stabilizations for `aes` and `neon` which occurred in Rust 1.61. Based on benchmarks this has no effect on performance, although it was necessary to place AESE/AESMC and AESD/AESIMC into a single `asm!` block in order to ensure that instructions fuse properly, as they did when using the proper intrinsics.
64c7f15
to
7818f35
Compare
aes_armv8
on Rust 1.59+ using asm!
aes_armv8
on Rust 1.61+ using asm!
Weird, it wouldn't compile on Let me play around with it a bit more. It's working fine on |
Huh, it's certainly looks weird. I don't think I've seen such behavior on x86. I would think that asm blocks should be completely opaque for compiler (apart from register allocation stuff). Maybe ask about it on IRLO or in a new Rust repository issue? |
Yeah, seems like some sort of bug in rustc. You can see the build failure here: https://github.com/RustCrypto/block-ciphers/actions/runs/5283017067/jobs/9558698657 Curiously it only seems to care about |
I'm pretty perplexed by the performance difference of 10debe5 too... it seems like unrelated code makes the performance greatly suffer
|
My guess is that it fails to inline the "intrinsics", despite having |
@newpavlov what's really weird is that the performance of Replacing |
I'm also having trouble extracting a minimal repro. Just |
I opened a rustc issue here: rust-lang/rust#112709 |
10debe5
to
cc70182
Compare
This seems to fix the build failures we were experiencing here: rust-lang/rust#112709
cc70182
to
c7fe62e
Compare
Okay, it seems there were some callers in the key expansion code which weren't properly annotated with The OS-specific discrepancies were due to With that new commit we're green on all targets, and I've confirmed performance is not impacted, so this is ready for review. |
@newpavlov hmm, perhaps we should use Edit: went ahead and did that in 26a070d |
aes_armv8
on Rust 1.61+ using asm!
aes_armv8
on Rust 1.61+ using asm!
Co-authored-by: Taiki Endo <te316e89@gmail.com>
Hmm, interesting, on an M2 Max the "parallel" versions seem to offer no speedup and in fact perform worse:
|
@tarcieri Also note that the best number of blocks processed in parallel is processor-dependent. It depends not only on latency/throughput of used instructions, but also on number of available registers. So even within the same target arch it may vary. 8 blocks is the optimal number for mainstream x86 processors, while on certain older x86 CPUs 6 blocks would perform better. So on M1/M2 a different number could produce better results, or maybe we even don't need explicit parallel processing for it, since these CPUs are famously wide and have deep reordering buffers. |
@newpavlov for AES-256 in particular it's outside the noise threshold, where the noise floor would be around 1,271 ns and the non-parallel version is around 1,180 ns. It's also fairly consistent and reproducible. I added the parallelism because there was a noticeable benefit on earlier M1 CPUs however, which are much more widespread. Anyway, just an observation. |
Maybe try to experiment with different numbers of parallel blocks? It would be interesting to see a plot with such data for M1 and M2. Also, maybe there are Apple recommendations for implementing AES similar to the Intel ones? |
Adds "polyfills" for the unstable ARMv8 AES intrinsics using the
asm!
macro which was stabilized in Rust 1.59. However note we also needtarget_feature
stabilizations foraes
andneon
which occurred in Rust 1.61.Based on benchmarks this has no effect on performance, although it was necessary to place AESE/AESMC and AESD/AESIMC into a single
asm!
block in order to ensure that instructions fuse properly, as they did when using the proper intrinsics.In the next breaking release, we should be able to get rid of the
aes_armv8
configuration parameter entirely, bumping MSRV to 1.59 and then ARMv8 support should Just Work(TM) where available.Performance appears to be unchanged.
Benchmarks (M1 Max)
Before
After
cc @codahale