8359419: AArch64: Relax min vector length to 32-bit for short vectors #26057

XiaohongGong · 2025-07-01T05:59:15Z

Background

On AArch64, the minimum vector length supported is 64-bit for basic types, except for byte and boolean (32-bit and 16-bit respectively to match special Vector API features). This limitation prevents intrinsification of vector type conversions between short and wider types (e.g. long/double) in Vector API when the entire vector length is within 128 bits, resulting in degraded performance for such conversions.

For example, type conversions between ShortVector.SPECIES_128 and LongVector.SPECIES_128 are not supported on AArch64 NEON and SVE architectures with 128-bit max vector size. This occurs because the compiler would need to generate a vector with 2 short elements, resulting in a 32-bit vector size.

To intrinsify such type conversion APIs, we need to relax the min vector length constraint from 64-bit to 32-bit for short vectors.

Impact Analysis

1. Vector types

Vectors only with short element types will be affected, as we just supported 32-bit short vectors in this change.

2. Vector API

No impact on Vector API or the vector-specific nodes. The minimum vector shape at API level remains 64-bit. It's not possible to generate a final vector IR with 32-bit vector size. Type conversions may generate intermediate 32-bit vectors, but they will be resized or cast to vectors with at least 64-bit length.

3. Auto-vectorization

Enables vectorization of cases containing only 2 short lanes, with significant performance improvements. Since we have supported 32-bit vectors for byte type for a long time, extending this to short did not introduce additional risks.

4. Codegen of vector nodes

NEON doesn't support 32-bit SIMD instructions, so we use 64-bit instructions instead. For lanewise operations, this is safe because the higher half bits can be ignored.

Details:

Lanewise vector operations are unaffected as explained above.
NEON supports vector load/store instructions with 32-bit vector size, which we already use in relevant IRs (shared by SVE).
Cross-lane operations like reduction may be affected, potentially causing incorrect results for min/max/mul/and reductions. The min vector size for such operations should remain 64-bit. We've added assertions in match rules. Since it's currently not possible to generate such reductions (Vector API minimum is 64-bit, and SLP doesn't support subword type reductions), we maintain the status quo. However, adding an explicit vector size check in match_rule_supported_vector() would be beneficial.
Missing codegen support for type conversions with 32-bit input or output vector size should be added.

Main changes:

Support 2 shorts vector types. The supported min vector element count for each basic type is:
- T_BOOLEAN: 2
- T_BYTE/T_CHAR: 4
- T_SHORT: 2 (new supported)
- T_INT/T_FLOAT/T_LONG/T_DOUBLE: 2
Add codegen support for Vector[U]Cast with 32-bit input or output vector size. VectorReinterpret has already considered the 32-bit vector size cases.
Unsupport reductions with less than 8 bytes vector size explicitly.
Add additional IR tests for Vector API type conversions.
Add JMH benchmark for auto-vectorization with two 16-bit lanes.

Test

Tested hotspot/jdk/langtools - all tests passed.

Performance

Following shows the performance improvement of relative VectorAPI JMHs on a NVIDIA Grace (128-bit SVE2) machine:

Benchmark                                             SIZE   Mode  Unit   Before     After    Gain
VectorFPtoIntCastOperations.microDouble128ToShort128  512   thrpt ops/ms  731.529  26278.599  35.92
VectorFPtoIntCastOperations.microDouble128ToShort128  1024  thrpt ops/ms  366.461  10595.767  28.91
VectorFPtoIntCastOperations.microFloat64ToShort64     512   thrpt ops/ms  315.791  14327.682  45.37
VectorFPtoIntCastOperations.microFloat64ToShort64     1024  thrpt ops/ms  158.485   7261.847  45.82
VectorZeroExtend.short2Long                           128   thrpt ops/ms 1447.243 898666.972 620.95

And here is the performance improvement of the added JMH on Grace:

Benchmark                          LEN   Mode  Unit   Before    After   Gain
VectorTwoShorts.addVec2S           64    avgt  ns/op   20.948   12.683  1.65
VectorTwoShorts.addVec2S           128   avgt  ns/op   40.073   22.703  1.76
VectorTwoShorts.addVec2S           512   avgt  ns/op  157.447   83.691  1.88
VectorTwoShorts.addVec2S           1024  avgt  ns/op  313.022  165.085  1.89
VectorTwoShorts.mulVec2S           64    avgt  ns/op   20.981   12.647  1.65
VectorTwoShorts.mulVec2S           128   avgt  ns/op   40.279   22.637  1.77
VectorTwoShorts.mulVec2S           512   avgt  ns/op  158.642   83.371  1.90
VectorTwoShorts.mulVec2S           1024  avgt  ns/op  314.788  165.205  1.90
VectorTwoShorts.reverseBytesVec2S  64    avgt  ns/op   17.739    9.106  1.94
VectorTwoShorts.reverseBytesVec2S  128   avgt  ns/op   32.591   15.632  2.08
VectorTwoShorts.reverseBytesVec2S  512   avgt  ns/op  126.154   55.284  2.28
VectorTwoShorts.reverseBytesVec2S  1024  avgt  ns/op  254.592  107.457  2.36

We can observe the similar uplift on an AArch64 N1 (NEON) machine.

Progress

Change must be properly reviewed (1 review required, with at least 1 Reviewer)
Change must not contain extraneous whitespace
Commit message must refer to an issue

Integration blocker

⚠️ Title mismatch between PR and JBS for issue JDK-8359419

Issue

JDK-8359419: AArch64: Support min vector size of 32-bit (Enhancement - P4) ⚠️ Title mismatch between PR and JBS.

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/26057/head:pull/26057
$ git checkout pull/26057

Update a local copy of the PR:
$ git checkout pull/26057
$ git pull https://git.openjdk.org/jdk.git pull/26057/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 26057

View PR using the GUI difftool:
$ git pr show -t 26057

Using diff file

Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/26057.diff

Using Webrev

Link to Webrev Comment

bridgekeeper · 2025-07-01T06:00:15Z

👋 Welcome back xgong! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

openjdk · 2025-07-01T06:00:47Z

❗ This change is not yet ready to be integrated.
See the Progress checklist in the description for automated requirements.

openjdk · 2025-07-01T06:01:16Z

@XiaohongGong The following labels will be automatically applied to this pull request:

core-libs
hotspot-compiler

When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing lists. If you would like to change these labels, use the /label pull request command.

mlbridge · 2025-07-01T06:04:28Z

Webrevs

00: Full (5af5bd49)

theRealAph · 2025-07-01T08:10:16Z

src/hotspot/cpu/aarch64/aarch64.ad

+  int size;
+  switch(bt) {
+    case T_BOOLEAN:
+      // It needs to load/store a vector mask with only 2 elements


Suggested change

// It needs to load/store a vector mask with only 2 elements

// Load/store a vector mask with only 2 elements

Same with the other cases.

Thanks so much for your comment. I will fix them soon.

theRealAph · 2025-07-01T08:10:58Z

src/hotspot/cpu/aarch64/aarch64.ad

+      size = 2;
+      break;
+    default:
+      // Limit the min vector length to 64-bit normally.


Suggested change

// Limit the min vector length to 64-bit normally.

// Limit the min vector length to 64-bit.

theRealAph · 2025-07-01T08:11:19Z

src/hotspot/cpu/aarch64/aarch64_vector.ad

+      case Op_MinReductionV:
+      case Op_MaxReductionV:
+        // Reductions with less than 8 bytes vector length are
+        // not supported for now.


Suggested change

// not supported for now.

// not supported.

8359419: AArch64: Relax min vector length to 32-bit for short vectors

5af5bd4

openjdk bot added the rfr Pull request is ready for review label Jul 1, 2025

openjdk bot added hotspot-compiler hotspot-compiler-dev@openjdk.org core-libs core-libs-dev@openjdk.org labels Jul 1, 2025

theRealAph reviewed Jul 1, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

8359419: AArch64: Relax min vector length to 32-bit for short vectors #26057

8359419: AArch64: Relax min vector length to 32-bit for short vectors #26057

XiaohongGong commented Jul 1, 2025 •

edited by openjdk bot

Loading

Uh oh!

bridgekeeper bot commented Jul 1, 2025

Uh oh!

openjdk bot commented Jul 1, 2025

Uh oh!

openjdk bot commented Jul 1, 2025

Uh oh!

mlbridge bot commented Jul 1, 2025

Uh oh!

theRealAph Jul 1, 2025

Uh oh!

XiaohongGong Jul 1, 2025

Uh oh!

theRealAph Jul 1, 2025

Uh oh!

theRealAph Jul 1, 2025

Uh oh!

Uh oh!

	// It needs to load/store a vector mask with only 2 elements
	// Load/store a vector mask with only 2 elements

	// Limit the min vector length to 64-bit normally.
	// Limit the min vector length to 64-bit.

8359419: AArch64: Relax min vector length to 32-bit for short vectors #26057

Are you sure you want to change the base?

8359419: AArch64: Relax min vector length to 32-bit for short vectors #26057

Conversation

XiaohongGong commented Jul 1, 2025 • edited by openjdk bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Background

Impact Analysis

1. Vector types

2. Vector API

3. Auto-vectorization

4. Codegen of vector nodes

Main changes:

Test

Performance

Progress

Integration blocker

Issue

Reviewing

Uh oh!

bridgekeeper bot commented Jul 1, 2025

Uh oh!

openjdk bot commented Jul 1, 2025

Uh oh!

openjdk bot commented Jul 1, 2025

Uh oh!

mlbridge bot commented Jul 1, 2025

Webrevs

Uh oh!

theRealAph Jul 1, 2025

Choose a reason for hiding this comment

Uh oh!

XiaohongGong Jul 1, 2025

Choose a reason for hiding this comment

Uh oh!

theRealAph Jul 1, 2025

Choose a reason for hiding this comment

Uh oh!

theRealAph Jul 1, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

XiaohongGong commented Jul 1, 2025 •

edited by openjdk bot

Loading