Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is there any documentation on address mapping to help developers use it? #13159

Open
wangzy0327 opened this issue Mar 26, 2024 · 8 comments
Open
Assignees
Labels
enhancement New feature or request

Comments

@wangzy0327
Copy link

Is your feature request related to a problem? Please describe

It is planned to expand new hardware based on SYCL. No relevant guidance has been found regarding the development of the address mapping part.Can you provide instructions or documents on address mapping for developers to refer to? This is the code part for the relevant address mapping based on 2022-06 version.What is the meaning of the contents of the NVPTXAddrSpaceMap variable? Which source files are involved in the relevant address space and the APIs called? @AlexeySachkov @elizabethandrews

Can you give me some help?

Describe the solution you would like

It is planned to expand new hardware based on SYCL about device memory access development.

Describe alternatives you have considered

No response

Additional context

No response

@wangzy0327 wangzy0327 added the enhancement New feature or request label Mar 26, 2024
@KornevNikita
Copy link
Contributor

@wangzy0327 hi, did Arvind's answer help you?

@wangzy0327
Copy link
Author

@wangzy0327 hi, did Arvind's answer help you?
When I was extending the SYCL code, I encountered the following error. It looks like an address space mapping problem. Can you give me some suggestions? How to analyze or debug?

@KornevNikita

The error line is PI_CHECK_ERROR(cnQueueSync(s));
pi_cnrt.cpp

pi_result cnrt_piQueueRelease(pi_queue command_queue) {
  assert(command_queue != nullptr);

  if (command_queue->decrement_reference_count() > 0) {
    return PI_SUCCESS;
  }

  try {
    std::unique_ptr<_pi_queue> queueImpl(command_queue);

    ScopedContext active(command_queue->get_context());

    command_queue->for_each_queue([](CNqueue s) {
      PI_CHECK_ERROR(cnQueueSync(s));
      PI_CHECK_ERROR(cnDestroyQueue(s));
    });

    return PI_SUCCESS;
  } catch (pi_result err) {
    return err;
  } catch (...) {
    return PI_ERROR_OUT_OF_RESOURCES;
  }
}

This is test program about the device extend.
test_demo.cpp

#include <CL/sycl.hpp>
#include <iostream>
#include <vector>
#include <sys/time.h>
using namespace sycl;

constexpr int N = 256;

long long getTime() {
    struct timeval tv;
    gettimeofday(&tv, NULL);
    return (tv.tv_sec*1000000 + tv.tv_usec);
}

int main(){
    sycl::queue q;
    auto dev = q.get_device();
    float *a = (float *)malloc(sizeof(float) * N);
    float *b = (float *)malloc(sizeof(float) * N);
    float *c = (float *)malloc(sizeof(float) * N);
    float *c_host = (float *)malloc(sizeof(float) * N);

    for(int i = 0;i < N;i++){
        a[i] = 0.5f;b[i] = 0.5f;c[i] = 0.0f;c_host[i] = 1.0f;
    }

    range<1> arr_range(N);

    sycl::buffer<float,1> bufferA((float*)a,arr_range);
    sycl::buffer<float,1> bufferB((float*)b,arr_range);
    sycl::buffer<float,1> bufferC((float*)c,arr_range);

    auto startTime = getTime();
    q.submit([&](handler &h){
        sycl::accessor aA{bufferA,h,read_only};
        sycl::accessor aB{bufferB,h,read_only};
        sycl::accessor aC{bufferC,h,write_only};
        sycl::accessor<float, 1, sycl::access::mode::read_write, sycl::access::target::local> localAccA(N,h);
        sycl::accessor<float, 1, sycl::access::mode::read_write, sycl::access::target::local> localAccB(N,h);

        h.parallel_for<>(1,[=](sycl::id<1> i){
            for(int j = 0;j < N;j++){
                localAccA[j] = aA[j];
                localAccB[j] = aB[j];
                aC[j] = localAccA[j] + localAccB[j];
            }
        });

    });
    sycl::host_accessor host_accC(bufferC,read_only);
    std::cout << "Result: " << host_accC[0] << " .. " << host_accC[N - 1] << std::endl;    
    auto endTime = getTime();
    std::cout << "Time : " << endTime - startTime <<" us "<< std::endl;
    free(a);
    free(b);
    free(c);
    free(c_host);
    return 0;

@asudarsa
Copy link
Contributor

asudarsa commented Apr 4, 2024

Hi @wangzy0327

I tried to compile your code using 'clang++ -fsycl test.cpp'. Hope that is the right way. I ran into a few issues. When I looked closer at your code, I saw a few issues:

  1. h.parallel_for<>(1,[=](sycl::id<1> i){ ...} --> 'i' is not used inside the kernel
  2. According to SYCL 2020 doc, local accessors cannot be used in the parallel_for overloaded version you are using.
    Please look at 4.9.4.2.2. parallel_for invoke for details.

Thanks

@asudarsa asudarsa self-assigned this Apr 4, 2024
@wangzy0327
Copy link
Author

wangzy0327 commented Apr 14, 2024

I tried to compile the above sample code using the cuda version and extended hardware version of sycl released in 2022-06. The device-side llvm ir code compiled by sycl-cuda is as follows.

simple-add-sm_70.ll
; Function Attrs: noinline norecurse
define weak_odr dso_local void @_ZTSZZ4mainENKUlRN2cl4sycl7handlerEE_clES2_EUlNS0_2idILi1EEEE_(float addrspace(3)* noundef align 4 %_arg_localAccA, float add
rspace(1)* noundef readonly align 4 %_arg_aA, %"class.cl::sycl::id"* noundef byval(%"class.cl::sycl::id") align 8 %_arg_aA6, float addrspace(1)* noundef align 4 %_arg_aC, %"class.cl::sycl::id"* noundef byval(%"class.cl::sycl::id") align 8 %_arg_aC9) local_unnamed_addr #0 comdat !kernel_arg_buffer_location !24 !kernel_arg_runtime_aligned !25 !kernel_arg_exclusive_ptr !25 !sycl_kernel_omit_args !26 {entry:
  %0 = getelementptr inbounds %"class.cl::sycl::id", %"class.cl::sycl::id"* %_arg_aA6, i64 0, i32 0, i32 0, i64 0
  %1 = load i64, i64* %0, align 8
  %add.ptr.i = getelementptr inbounds float, float addrspace(1)* %_arg_aA, i64 %1
  %2 = getelementptr inbounds %"class.cl::sycl::id", %"class.cl::sycl::id"* %_arg_aC9, i64 0, i32 0, i32 0, i64 0
  %3 = load i64, i64* %2, align 8
  %add.ptr.i41 = getelementptr inbounds float, float addrspace(1)* %_arg_aC, i64 %3
  %4 = tail call i32 @llvm.nvvm.read.ptx.sreg.ctaid.x() #4
  %conv.i.i.i.i.i.i.i = sext i32 %4 to i64
  %5 = tail call i32 @llvm.nvvm.read.ptx.sreg.ntid.x() #4
  %conv.i1.i.i.i.i.i.i = sext i32 %5 to i64
  %mul.i.i.i.i.i.i = mul nsw i64 %conv.i1.i.i.i.i.i.i, %conv.i.i.i.i.i.i.i
  %6 = tail call i32 @llvm.nvvm.read.ptx.sreg.tid.x() #4
  %conv.i2.i.i.i.i.i.i = sext i32 %6 to i64
  %add.i.i.i.i.i.i = add nsw i64 %mul.i.i.i.i.i.i, %conv.i2.i.i.i.i.i.i
  %7 = tail call i32* @llvm.nvvm.implicit.offset() #4
  %8 = load i32, i32* %7, align 4, !tbaa !14
  %conv.i3.i.i.i.i.i.i = zext i32 %8 to i64
  %add4.i.i.i.i.i.i = add nsw i64 %add.i.i.i.i.i.i, %conv.i3.i.i.i.i.i.i
  %cmp.i.i.i = icmp ult i64 %add4.i.i.i.i.i.i, 2147483648
  tail call void @llvm.assume(i1 %cmp.i.i.i) #4
  br label %for.body.i

for.body.i:                                       ; preds = %for.body.i, %entry
  %j.015.i = phi i32 [ 0, %entry ], [ %inc.i.1, %for.body.i ]
  %conv.i = zext i32 %j.015.i to i64
  %arrayidx.i.i42 = getelementptr inbounds float, float addrspace(1)* %add.ptr.i, i64 %conv.i
  %arrayidx.ascast.i.i = addrspacecast float addrspace(1)* %arrayidx.i.i42 to float*
  %9 = load float, float* %arrayidx.ascast.i.i, align 4, !tbaa !18
  %arrayidx.i3.i = getelementptr inbounds float, float addrspace(3)* %_arg_localAccA, i64 %conv.i
  %arrayidx.ascast.i4.i = addrspacecast float addrspace(3)* %arrayidx.i3.i to float*
  store float %9, float* %arrayidx.ascast.i4.i, align 4, !tbaa !18
  %add.i = fadd float %9, 5.000000e-01
  %arrayidx.i11.i = getelementptr inbounds float, float addrspace(1)* %add.ptr.i41, i64 %conv.i
  %arrayidx.ascast.i12.i = addrspacecast float addrspace(1)* %arrayidx.i11.i to float*
  store float %add.i, float* %arrayidx.ascast.i12.i, align 4, !tbaa !18
  %inc.i = or i32 %j.015.i, 1
  %conv.i.1 = zext i32 %inc.i to i64
  %arrayidx.i.i42.1 = getelementptr inbounds float, float addrspace(1)* %add.ptr.i, i64 %conv.i.1
  %arrayidx.ascast.i.i.1 = addrspacecast float addrspace(1)* %arrayidx.i.i42.1 to float*
  %10 = load float, float* %arrayidx.ascast.i.i.1, align 4, !tbaa !18
  %arrayidx.i3.i.1 = getelementptr inbounds float, float addrspace(3)* %_arg_localAccA, i64 %conv.i.1
  %arrayidx.ascast.i4.i.1 = addrspacecast float addrspace(3)* %arrayidx.i3.i.1 to float*
  store float %10, float* %arrayidx.ascast.i4.i.1, align 4, !tbaa !18
  %add.i.1 = fadd float %10, 5.000000e-01
  %arrayidx.i11.i.1 = getelementptr inbounds float, float addrspace(1)* %add.ptr.i41, i64 %conv.i.1
  %arrayidx.ascast.i12.i.1 = addrspacecast float addrspace(1)* %arrayidx.i11.i.1 to float*
  store float %add.i.1, float* %arrayidx.ascast.i12.i.1, align 4, !tbaa !18
  %inc.i.1 = add nuw nsw i32 %j.015.i, 2
  %exitcond.not.i.1 = icmp eq i32 %inc.i.1, 256
  br i1 %exitcond.not.i.1, label %_ZZZ4mainENKUlRN2cl4sycl7handlerEE_clES2_ENKUlNS0_2idILi1EEEE_clES5_.exit, label %for.body.i, !llvm.loop !22

_ZZZ4mainENKUlRN2cl4sycl7handlerEE_clES2_ENKUlNS0_2idILi1EEEE_clES5_.exit: ; preds = %for.body.i
  ret void

; uselistorder directives
  uselistorder float addrspace(3)* %_arg_localAccA, { 1, 0 }
  uselistorder i32 %j.015.i, { 1, 0, 2 }
  uselistorder i32 %inc.i.1, { 1, 0 }
}

The device-side llvm ir code compiled by the extended hardware is as follows.

simple-add-mtp_372.ll
; Function Attrs: convergent noinline norecurse
define weak_odr dso_local void @_ZTSN2cl4sycl6detail18RoundedRangeKernelINS0_4itemILi1ELb1EEELi1EZZ4mainENKUlRNS0_7handlerEE_clES6_EUlNS0_2idILi1EEEE_EE(%"cl
ass.cl::sycl::range"* noundef byval(%"class.cl::sycl::range") align 8 %_arg_NumWorkItems, float addrspace(101)* noundef align 4 %_arg_localAccA, %"class.cl::sycl::range"* noundef byval(%"class.cl::sycl::range") align 8 %_arg_localAccA1, %"class.cl::sycl::range"* noundef byval(%"class.cl::sycl::range") align 8 %_arg_localAccA2, %"class.cl::sycl::id"* noundef byval(%"class.cl::sycl::id") align 8 %_arg_localAccA3, float addrspace(1)* noundef readonly align 4 %_arg_aA, %"class.cl::sycl::range"* noundef byval(%"class.cl::sycl::range") align 8 %_arg_aA4, %"class.cl::sycl::range"* noundef byval(%"class.cl::sycl::range") align 8 %_arg_aA5, %"class.cl::sycl::id"* noundef byval(%"class.cl::sycl::id") align 8 %_arg_aA6, float addrspace(1)* noundef align 4 %_arg_aC, %"class.cl::sycl::range"* noundef byval(%"class.cl::sycl::range") align 8 %_arg_aC7, %"class.cl::sycl::range"* noundef byval(%"class.cl::sycl::range") align 8 %_arg_aC8, %"class.cl::sycl::id"* noundef byval(%"class.cl::sycl::id") align 8 %_arg_aC9) local_unnamed_addr #0 comdat !kernel_arg_buffer_location !9 !kernel_arg_runtime_aligned !10 !kernel_arg_exclusive_ptr !10 {entry:
  %0 = getelementptr inbounds %"class.cl::sycl::range", %"class.cl::sycl::range"* %_arg_NumWorkItems, i64 0, i32 0, i32 0, i64 0
  %1 = load i64, i64* %0, align 8
  %2 = getelementptr inbounds %"class.cl::sycl::id", %"class.cl::sycl::id"* %_arg_aA6, i64 0, i32 0, i32 0, i64 0
  %3 = load i64, i64* %2, align 8
  %add.ptr.i = getelementptr inbounds float, float addrspace(1)* %_arg_aA, i64 %3
  %4 = getelementptr inbounds %"class.cl::sycl::id", %"class.cl::sycl::id"* %_arg_aC9, i64 0, i32 0, i32 0, i64 0
  %5 = load i64, i64* %4, align 8
  %add.ptr.i44 = getelementptr inbounds float, float addrspace(1)* %_arg_aC, i64 %5
  %6 = tail call i32 @llvm.mlvm.read.mlu.sreg.taskidx() #5
  %conv.i.i.i.i.i.i = sext i32 %6 to i64
  %call.i.i.i.i.i.i = tail call i64 @_Z23__spirv_NumWorkgroups_xv() #6
  %call1.i.i.i.i.i.i = tail call i64 @_Z23__spirv_WorkgroupSize_xv() #6
  %call.i.i.i.i.i = tail call noundef i64 @_Z22__spirv_GlobalOffset_xv() #7
  %cmp.i.i = icmp sgt i32 %6, -1
  tail call void @llvm.assume(i1 %cmp.i.i) #5
  %cmp.not.i = icmp ugt i64 %1, %conv.i.i.i.i.i.i
  br i1 %cmp.not.i, label %for.body.i.i, label %_ZNK2cl4sycl6detail18RoundedRangeKernelINS0_4itemILi1ELb1EEELi1EZZ4mainENKUlRNS0_7handlerEE_clES6_EUlNS0_2idI
Li1EEEE_EclES4_.exit
for.body.i.i:                                     ; preds = %entry, %for.body.i.i
  %indvars.iv.i.i = phi i64 [ %indvars.iv.next.i.i, %for.body.i.i ], [ 0, %entry ]
  %arrayidx.i.i6.i = getelementptr inbounds float, float addrspace(1)* %add.ptr.i, i64 %indvars.iv.i.i
  %arrayidx.ascast.i.i.i = addrspacecast float addrspace(1)* %arrayidx.i.i6.i to float*
  %7 = load float, float* %arrayidx.ascast.i.i.i, align 4, !tbaa !11
  %arrayidx.i3.i.i = getelementptr inbounds float, float addrspace(101)* %_arg_localAccA, i64 %indvars.iv.i.i
  %arrayidx.ascast.i4.i.i = addrspacecast float addrspace(101)* %arrayidx.i3.i.i to float*
  store float %7, float* %arrayidx.ascast.i4.i.i, align 4, !tbaa !11
  %add.i.i = fadd float %7, 5.000000e-01
  %arrayidx.i11.i.i = getelementptr inbounds float, float addrspace(1)* %add.ptr.i44, i64 %indvars.iv.i.i
  %arrayidx.ascast.i12.i.i = addrspacecast float addrspace(1)* %arrayidx.i11.i.i to float*
  store float %add.i.i, float* %arrayidx.ascast.i12.i.i, align 4, !tbaa !11
  %indvars.iv.next.i.i = add nuw nsw i64 %indvars.iv.i.i, 1
  %exitcond.not.i.i = icmp eq i64 %indvars.iv.next.i.i, 256
  br i1 %exitcond.not.i.i, label %_ZNK2cl4sycl6detail18RoundedRangeKernelINS0_4itemILi1ELb1EEELi1EZZ4mainENKUlRNS0_7handlerEE_clES6_EUlNS0_2idILi1EEEE_EclES4
_.exit, label %for.body.i.i, !llvm.loop !15
_ZNK2cl4sycl6detail18RoundedRangeKernelINS0_4itemILi1ELb1EEELi1EZZ4mainENKUlRNS0_7handlerEE_clES6_EUlNS0_2idILi1EEEE_EclES4_.exit: ; preds = %for.body.i.i, %
entry  ret void

; uselistorder directives
  uselistorder label %for.body.i.i, { 1, 0 }
  uselistorder i64 %indvars.iv.next.i.i, { 1, 0 }
}

; Function Attrs: inaccessiblememonly mustprogress nocallback nofree nosync nounwind willreturn
declare void @llvm.assume(i1 noundef) #1

; Function Attrs: convergent
declare dso_local noundef i64 @_Z22__spirv_GlobalOffset_xv() local_unnamed_addr #2

; Function Attrs: convergent noinline norecurse
define weak_odr dso_local void @_ZTSZZ4mainENKUlRN2cl4sycl7handlerEE_clES2_EUlNS0_2idILi1EEEE_(float addrspace(101)* noundef align 4 %_arg_localAccA, %"class
.cl::sycl::range"* noundef byval(%"class.cl::sycl::range") align 8 %_arg_localAccA1, %"class.cl::sycl::range"* noundef byval(%"class.cl::sycl::range") align 8 %_arg_localAccA2, %"class.cl::sycl::id"* noundef byval(%"class.cl::sycl::id") align 8 %_arg_localAccA3, float addrspace(1)* noundef readonly align 4 %_arg_aA, %"class.cl::sycl::range"* noundef byval(%"class.cl::sycl::range") align 8 %_arg_aA4, %"class.cl::sycl::range"* noundef byval(%"class.cl::sycl::range") align 8 %_arg_aA5, %"class.cl::sycl::id"* noundef byval(%"class.cl::sycl::id") align 8 %_arg_aA6, float addrspace(1)* noundef align 4 %_arg_aC, %"class.cl::sycl::range"* noundef byval(%"class.cl::sycl::range") align 8 %_arg_aC7, %"class.cl::sycl::range"* noundef byval(%"class.cl::sycl::range") align 8 %_arg_aC8, %"class.cl::sycl::id"* noundef byval(%"class.cl::sycl::id") align 8 %_arg_aC9) local_unnamed_addr #0 comdat !kernel_arg_buffer_location !17 !kernel_arg_runtime_aligned !18 !kernel_arg_exclusive_ptr !18 {entry:
  %0 = getelementptr inbounds %"class.cl::sycl::id", %"class.cl::sycl::id"* %_arg_aA6, i64 0, i32 0, i32 0, i64 0
  %1 = load i64, i64* %0, align 8
  %add.ptr.i = getelementptr inbounds float, float addrspace(1)* %_arg_aA, i64 %1
  %2 = getelementptr inbounds %"class.cl::sycl::id", %"class.cl::sycl::id"* %_arg_aC9, i64 0, i32 0, i32 0, i64 0
  %3 = load i64, i64* %2, align 8
  %add.ptr.i41 = getelementptr inbounds float, float addrspace(1)* %_arg_aC, i64 %3
  %4 = tail call i32 @llvm.mlvm.read.mlu.sreg.taskidx() #5
  %call.i.i.i.i.i.i = tail call i64 @_Z23__spirv_NumWorkgroups_xv() #6
  %call1.i.i.i.i.i.i = tail call i64 @_Z23__spirv_WorkgroupSize_xv() #6
  %call.i.i.i.i.i = tail call noundef i64 @_Z22__spirv_GlobalOffset_xv() #7
  %cmp.i.i = icmp sgt i32 %4, -1
  tail call void @llvm.assume(i1 %cmp.i.i) #5
  br label %for.body.i

for.body.i:                                       ; preds = %for.body.i, %entry
  %indvars.iv.i = phi i64 [ 0, %entry ], [ %indvars.iv.next.i, %for.body.i ]
  %arrayidx.i.i42 = getelementptr inbounds float, float addrspace(1)* %add.ptr.i, i64 %indvars.iv.i
  %arrayidx.ascast.i.i = addrspacecast float addrspace(1)* %arrayidx.i.i42 to float*
  %5 = load float, float* %arrayidx.ascast.i.i, align 4, !tbaa !11
  %arrayidx.i3.i = getelementptr inbounds float, float addrspace(101)* %_arg_localAccA, i64 %indvars.iv.i
  %arrayidx.ascast.i4.i = addrspacecast float addrspace(101)* %arrayidx.i3.i to float*
  store float %5, float* %arrayidx.ascast.i4.i, align 4, !tbaa !11
  %add.i = fadd float %5, 5.000000e-01
  %arrayidx.i11.i = getelementptr inbounds float, float addrspace(1)* %add.ptr.i41, i64 %indvars.iv.i
  %arrayidx.ascast.i12.i = addrspacecast float addrspace(1)* %arrayidx.i11.i to float*
  store float %add.i, float* %arrayidx.ascast.i12.i, align 4, !tbaa !11
  %indvars.iv.next.i = add nuw nsw i64 %indvars.iv.i, 1
  %exitcond.not.i = icmp eq i64 %indvars.iv.next.i, 256
  br i1 %exitcond.not.i, label %_ZZZ4mainENKUlRN2cl4sycl7handlerEE_clES2_ENKUlNS0_2idILi1EEEE_clES5_.exit, label %for.body.i, !llvm.loop !15

_ZZZ4mainENKUlRN2cl4sycl7handlerEE_clES2_ENKUlNS0_2idILi1EEEE_clES5_.exit: ; preds = %for.body.i
  ret void

; uselistorder directives
  uselistorder i64 %indvars.iv.next.i, { 1, 0 }
}

It is found that the handler of the extended hardware does not have the address 1 address number. How to fix this problem? How are the variable parameters of address 1 address defined and used? @KornevNikita @sommerlukas @elizabethandrews
reference to source code (clang/lib/Basic/Targets/NVPTX.h)NVPTXAddrSpaceMap

Copy link
Contributor

Hi! There have been no updates for at least the last 60 days, though the issue has assignee(s).

@asudarsa, could you please take one of the following actions:

  • provide an update if you have any
  • unassign yourself if you're not looking / going to look into this issue
  • mark this issue with the 'confirmed' label if you have confirmed the problem/request and our team should work on it
  • close the issue if it has been resolved
  • take any other suitable action.

Thanks!

@wangzy0327
Copy link
Author

How to develop the address space mapping for expanding new hardware? Can you give some specific suggestions and guidance? @asudarsa

Copy link
Contributor

Hi! There have been no updates for at least the last 60 days, though the issue has assignee(s).

@asudarsa, could you please take one of the following actions:

  • provide an update if you have any
  • unassign yourself if you're not looking / going to look into this issue
  • mark this issue with the 'confirmed' label if you have confirmed the problem/request and our team should work on it
  • close the issue if it has been resolved
  • take any other suitable action.

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants