More register used when multiple target regions are compiled together #24

ye-luo · 2019-08-03T22:11:32Z

The source code I'm using has multiple offload regions in different member functions of a class.
If I enable individual target region and comment the other target pragma
Kernel 1 only

      NumSGPRs:        90
      NumVGPRs:        256
      NumSpilledVGPRs: 158

kernel 2 only

      NumSGPRs:        86
      NumVGPRs:        164

If I enabled both offload regions.
kernel 1

      NumSGPRs:        90
      NumVGPRs:        256
      NumSpilledVGPRs: 160

kernal 2

      NumSGPRs:        86
      NumVGPRs:        256
      NumSpilledVGPRs: 160

The amount of needed vector register + spill is more than individually ones.
Both kernels are compiled from independent target regions. This behaviour seems very strange.

The text was updated successfully, but these errors were encountered:

gregrodgers · 2019-08-05T15:49:59Z

This is good information. I would like to know what optimization level if any you requested. Can you attach your source and command line? Thank you.

ye-luo · 2019-08-06T13:38:01Z

reproducer

git clone https://github.com/ye-luo/miniqmc
cd miniqmc/build
cmake -DCMAKE_CXX_COMPILER=/home/yeluo/rocm/aomp_0.7-0/bin/clang++ \
-DENABLE_OFFLOAD=1 -DOFFLOAD_TARGET=amdgcn-amd-amdhsa \
-DCMAKE_CXX_FLAGS="-Xopenmp-target=amdgcn-amd-amdhsa -march=gfx906 -v" \
..
make -j15 check_spo_batched

src/QMCWaveFunctions/einspline_spo_omp.cpp
line 159, 238, 311, 405 have offload regions for heavy computation.
The kernel at 311 has

    CodeProps:
      KernargSegmentSize: 72
      GroupSegmentFixedSize: 1024
      PrivateSegmentFixedSize: 872
      KernargSegmentAlign: 8
      WavefrontSize:   64
      NumSGPRs:        92
      NumVGPRs:        256
      MaxFlatWorkGroupSize: 256
      NumSpilledVGPRs: 375

Now just comment the #pragma omp 149, 238, 405 but leave 311.

make -j15 check_spo_batched
    CodeProps:
      KernargSegmentSize: 72
      GroupSegmentFixedSize: 766
      PrivateSegmentFixedSize: 48
      KernargSegmentAlign: 8
      WavefrontSize:   64
      NumSGPRs:        88
      NumVGPRs:        250
      MaxFlatWorkGroupSize: 256

The NumVGPRs reduces and there is no spill.

Another test, if I add right before line 311

#pragma omp target
{ }

The newly added kernel has

    CodeProps:
      KernargSegmentSize: 0
      GroupSegmentFixedSize: 754
      PrivateSegmentFixedSize: 0
      KernargSegmentAlign: 4
      WavefrontSize:   64
      NumSGPRs:        40
      NumVGPRs:        248
      MaxFlatWorkGroupSize: 256

All the numbers are significantly larger than the numbers given when the empty offload region is compiled standalone.

JonChesterfield · 2020-08-14T17:12:14Z

-v doesn't produce this output anymore. A potentially useful alternative is -mllvm -amdgpu-dump-hsa-metadata
, which produces yaml output like:

AMDGPU HSA Metadata:
---
amdhsa.kernels:
  - .args:
      - .address_space:  generic
        .name:           isHost
        .offset:         0
        .size:           8
        .value_kind:     global_buffer
    .group_segment_fixed_size: 915
    .kernarg_segment_align: 8
    .kernarg_segment_size: 8
    .language:       OpenCL C
    .language_version:
      - 2
      - 0
    .max_flat_workgroup_size: 256
    .name:           __omp_offloading_fd00_261b86_kernel_l7
    .private_segment_fixed_size: 0
    .sgpr_count:     25
    .sgpr_spill_count: 0
    .symbol:         __omp_offloading_fd00_261b86_kernel_l7.kd
    .vgpr_count:     22
    .vgpr_spill_count: 0
    .wavefront_size: 64

Alternatively, that is available by reading the msgpack data from the shared library (elf) containing device code.

ronlieb · 2023-11-21T16:21:00Z

Hi Ye, this one is over 3 years old, closing. if still an issue please reopen,, or open new issue.

ronlieb mentioned this issue Sep 3, 2019

isa<X>(Val) && "cast<Ty>() argument of incompatible type! failed for target=amdgcn-amd-amdhsa -march=gfx900 #29

Closed

yhmtsai mentioned this issue Apr 15, 2020

pragma unroll warning in hip ginkgo-project/ginkgo#492

Merged

gregrodgers self-assigned this May 19, 2020

ronlieb mentioned this issue Jul 20, 2020

AOMP failure when compiling programs for NVIDIA GPUs #118

Closed

ronlieb mentioned this issue Aug 14, 2020

Kokkos: Getting ICE on ItaniumMangleContextImpl::mangleCXXName #79

Closed

ronlieb closed this as completed Nov 21, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

More register used when multiple target regions are compiled together #24

More register used when multiple target regions are compiled together #24

ye-luo commented Aug 3, 2019

gregrodgers commented Aug 5, 2019

ye-luo commented Aug 6, 2019

JonChesterfield commented Aug 14, 2020 •

edited

Loading

ronlieb commented Nov 21, 2023

More register used when multiple target regions are compiled together #24

More register used when multiple target regions are compiled together #24

Comments

ye-luo commented Aug 3, 2019

gregrodgers commented Aug 5, 2019

ye-luo commented Aug 6, 2019

JonChesterfield commented Aug 14, 2020 • edited Loading

ronlieb commented Nov 21, 2023

JonChesterfield commented Aug 14, 2020 •

edited

Loading