Skip to content

Zen3 scheduler model for the latency of VEXTRACTF128rri is probably incorrect #146564

Open
@TiborGY

Description

@TiborGY

See also discussion at https://discourse.llvm.org/t/are-the-latencies-of-vextractf128-correct-for-zen2-3-in-mca/86422

LLVM MCA relies on LLVM's scheduler models to predict cycle counts. This is the predicted timeline graph for a small snippet on Zen3:

[0,0]     DeeeeeeeeER    .    .   vmovapd       (%rdi), %ymm0
[0,1]     D=eeeeeeeeeeER .    .   vsubpd        (%rsi), %ymm0, %ymm0
[0,2]     D===========eeeER   .   vmulpd        %ymm0, %ymm0, %ymm0
[0,3]     D==============eeeeER   vextractf128  $1, %ymm0, %xmm1
[0,4]     D==============eE---R   vmovhlps      %xmm0, %xmm0, %xmm2

As you can see, vextractf128 is predicted to have 4 cycles of latency. This however is inconsistent with both Agner Fogs latency tables (which list 3 cycles) and my own measurements with llvm-exegesis.

./llvm-exegesis -mode=latency -opcode-name=VEXTRACTF128rri -mcpu=znver3 --benchmark-repeat-count=100000 -min-instructions=1000  --repetition-mode=loop
---
mode:            latency
key:
  instructions:
    - 'VEXTRACTF128rri XMM0 YMM0 i_0x1'
  config:          ''
  register_initial_values:
    - 'YMM0=0x0'
cpu_name:        znver3
llvm_triple:     x86_64-unknown-linux-gnu
min_instructions: 1000
measurements:
  - { key: latency, value: 3.15, per_snippet_value: 3.15, validation_counters: {} }
error:           ''
info:            Repeating a single explicitly serial instruction
assembled_snippet: 4883EC20C7042400000000C744240400000000C744240800000000C744240C00000000C744241000000000C744241400000000C744241800000000C744241C00000000C5FE6F04244883C42049B80200000000000000662E0F1F840000000000C4E37D19C001C4E37D19C0014983C0FF75EEC3
...

Confusingly, AMD's official instruction latency table for Zen3 (Family_19h_Instruction_Latencies_version_1-00.xlsx, AMD Publication No. 56665 Revision 3.00 November 2020) lists vextractf128 as having 4 cycles of latency. Perhaps I am misinterpreting my measurement results, but I cannot see how that figure could be correct. My confidence in the accuracy of the official latency table is further eroded by the fact that the two vextractf128 variants are both listed with empty operand fields.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions