Skip to content

8359419: AArch64: Relax min vector length to 32-bit for short vectors #26057

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

XiaohongGong
Copy link

@XiaohongGong XiaohongGong commented Jul 1, 2025

Background

On AArch64, the minimum vector length supported is 64-bit for basic types, except for byte and boolean (32-bit and 16-bit respectively to match special Vector API features). This limitation prevents intrinsification of vector type conversions between short and wider types (e.g. long/double) in Vector API when the entire vector length is within 128 bits, resulting in degraded performance for such conversions.

For example, type conversions between ShortVector.SPECIES_128 and LongVector.SPECIES_128 are not supported on AArch64 NEON and SVE architectures with 128-bit max vector size. This occurs because the compiler would need to generate a vector with 2 short elements, resulting in a 32-bit vector size.

To intrinsify such type conversion APIs, we need to relax the min vector length constraint from 64-bit to 32-bit for short vectors.

Impact Analysis

1. Vector types

Vectors only with short element types will be affected, as we just supported 32-bit short vectors in this change.

2. Vector API

No impact on Vector API or the vector-specific nodes. The minimum vector shape at API level remains 64-bit. It's not possible to generate a final vector IR with 32-bit vector size. Type conversions may generate intermediate 32-bit vectors, but they will be resized or cast to vectors with at least 64-bit length.

3. Auto-vectorization

Enables vectorization of cases containing only 2 short lanes, with significant performance improvements. Since we have supported 32-bit vectors for byte type for a long time, extending this to short did not introduce additional risks.

4. Codegen of vector nodes

NEON doesn't support 32-bit SIMD instructions, so we use 64-bit instructions instead. For lanewise operations, this is safe because the higher half bits can be ignored.

Details:

  • Lanewise vector operations are unaffected as explained above.
  • NEON supports vector load/store instructions with 32-bit vector size, which we already use in relevant IRs (shared by SVE).
  • Cross-lane operations like reduction may be affected, potentially causing incorrect results for min/max/mul/and reductions. The min vector size for such operations should remain 64-bit. We've added assertions in match rules. Since it's currently not possible to generate such reductions (Vector API minimum is 64-bit, and SLP doesn't support subword type reductions), we maintain the status quo. However, adding an explicit vector size check in match_rule_supported_vector() would be beneficial.
  • Missing codegen support for type conversions with 32-bit input or output vector size should be added.

Main changes:

  • Support 2 shorts vector types. The supported min vector element count for each basic type is:
    • T_BOOLEAN: 2
    • T_BYTE/T_CHAR: 4
    • T_SHORT: 2 (new supported)
    • T_INT/T_FLOAT/T_LONG/T_DOUBLE: 2
  • Add codegen support for Vector[U]Cast with 32-bit input or output vector size. VectorReinterpret has already considered the 32-bit vector size cases.
  • Unsupport reductions with less than 8 bytes vector size explicitly.
  • Add additional IR tests for Vector API type conversions.
  • Add JMH benchmark for auto-vectorization with two 16-bit lanes.

Test

Tested hotspot/jdk/langtools - all tests passed.

Performance

Following shows the performance improvement of relative VectorAPI JMHs on a NVIDIA Grace (128-bit SVE2) machine:

Benchmark                                             SIZE   Mode  Unit   Before     After    Gain
VectorFPtoIntCastOperations.microDouble128ToShort128  512   thrpt ops/ms  731.529  26278.599  35.92
VectorFPtoIntCastOperations.microDouble128ToShort128  1024  thrpt ops/ms  366.461  10595.767  28.91
VectorFPtoIntCastOperations.microFloat64ToShort64     512   thrpt ops/ms  315.791  14327.682  45.37
VectorFPtoIntCastOperations.microFloat64ToShort64     1024  thrpt ops/ms  158.485   7261.847  45.82
VectorZeroExtend.short2Long                           128   thrpt ops/ms 1447.243 898666.972 620.95

And here is the performance improvement of the added JMH on Grace:

Benchmark                          LEN   Mode  Unit   Before    After   Gain
VectorTwoShorts.addVec2S           64    avgt  ns/op   20.948   12.683  1.65
VectorTwoShorts.addVec2S           128   avgt  ns/op   40.073   22.703  1.76
VectorTwoShorts.addVec2S           512   avgt  ns/op  157.447   83.691  1.88
VectorTwoShorts.addVec2S           1024  avgt  ns/op  313.022  165.085  1.89
VectorTwoShorts.mulVec2S           64    avgt  ns/op   20.981   12.647  1.65
VectorTwoShorts.mulVec2S           128   avgt  ns/op   40.279   22.637  1.77
VectorTwoShorts.mulVec2S           512   avgt  ns/op  158.642   83.371  1.90
VectorTwoShorts.mulVec2S           1024  avgt  ns/op  314.788  165.205  1.90
VectorTwoShorts.reverseBytesVec2S  64    avgt  ns/op   17.739    9.106  1.94
VectorTwoShorts.reverseBytesVec2S  128   avgt  ns/op   32.591   15.632  2.08
VectorTwoShorts.reverseBytesVec2S  512   avgt  ns/op  126.154   55.284  2.28
VectorTwoShorts.reverseBytesVec2S  1024  avgt  ns/op  254.592  107.457  2.36

We can observe the similar uplift on an AArch64 N1 (NEON) machine.


Progress

  • Change must be properly reviewed (1 review required, with at least 1 Reviewer)
  • Change must not contain extraneous whitespace
  • Commit message must refer to an issue

Integration blocker

 ⚠️ Title mismatch between PR and JBS for issue JDK-8359419

Issue

  • JDK-8359419: AArch64: Support min vector size of 32-bit (Enhancement - P4) ⚠️ Title mismatch between PR and JBS.

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/26057/head:pull/26057
$ git checkout pull/26057

Update a local copy of the PR:
$ git checkout pull/26057
$ git pull https://git.openjdk.org/jdk.git pull/26057/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 26057

View PR using the GUI difftool:
$ git pr show -t 26057

Using diff file

Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/26057.diff

Using Webrev

Link to Webrev Comment

@bridgekeeper
Copy link

bridgekeeper bot commented Jul 1, 2025

👋 Welcome back xgong! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

@openjdk
Copy link

openjdk bot commented Jul 1, 2025

❗ This change is not yet ready to be integrated.
See the Progress checklist in the description for automated requirements.

@openjdk openjdk bot added the rfr Pull request is ready for review label Jul 1, 2025
@openjdk
Copy link

openjdk bot commented Jul 1, 2025

@XiaohongGong The following labels will be automatically applied to this pull request:

  • core-libs
  • hotspot-compiler

When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing lists. If you would like to change these labels, use the /label pull request command.

@openjdk openjdk bot added hotspot-compiler hotspot-compiler-dev@openjdk.org core-libs core-libs-dev@openjdk.org labels Jul 1, 2025
@mlbridge
Copy link

mlbridge bot commented Jul 1, 2025

Webrevs

int size;
switch(bt) {
case T_BOOLEAN:
// It needs to load/store a vector mask with only 2 elements
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// It needs to load/store a vector mask with only 2 elements
// Load/store a vector mask with only 2 elements

Same with the other cases.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks so much for your comment. I will fix them soon.

size = 2;
break;
default:
// Limit the min vector length to 64-bit normally.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// Limit the min vector length to 64-bit normally.
// Limit the min vector length to 64-bit.

case Op_MinReductionV:
case Op_MaxReductionV:
// Reductions with less than 8 bytes vector length are
// not supported for now.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// not supported for now.
// not supported.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core-libs core-libs-dev@openjdk.org hotspot-compiler hotspot-compiler-dev@openjdk.org rfr Pull request is ready for review
Development

Successfully merging this pull request may close these issues.

2 participants