Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CUDA compile error on SharedToGlobal1D #1388

Closed
edopao opened this issue Oct 9, 2023 · 0 comments · Fixed by #1442
Closed

CUDA compile error on SharedToGlobal1D #1388

edopao opened this issue Oct 9, 2023 · 0 comments · Fixed by #1442
Assignees

Comments

@edopao
Copy link
Collaborator

edopao commented Oct 9, 2023

Describe the bug
The CUDA code generated for the attached SDFG cannot be compiled:

.dacecache/calculate_nabla2_for_w_gpu/src/cuda/calculate_nabla2_for_w_gpu_cuda.cu(95): error: too many arguments for class template "dace::SharedToGlobal1D"


 91                     dace::wcr_fixed<dace::ReductionType::Sum, double>::reduce_atomic(__var_174, *(&__var_228));
 92                 }
 93             }
 94         }
 95         dace::SharedToGlobal1D<double, 4, 1, 1, 1, 1, true>(__var_174, 1, __var_230);
 96 
 97     }

The problem disappears if I enable the template SharedToGlobal1D in copy.cuh which is currently commented out:

    /*
    template <typename T, int BLOCK_WIDTH, int BLOCK_HEIGHT, int BLOCK_DEPTH,
        int COPY_XLEN, int DST_XSTRIDE,
        bool ASYNC>
        static DACE_DFI void SharedToGlobal1D(
            const T *smem, int src_xstride, T *ptr)
    {
        GlobalToShared3D<T, BLOCK_WIDTH, BLOCK_HEIGHT, BLOCK_DEPTH, 1,
            1, COPY_XLEN, 1, 1, DST_XSTRIDE, ASYNC>(
                smem, 1, 1, src_xstride, ptr);
    }
    */

So it seems to me that the lowering to CUDA code does not make use of the right template construct.

To Reproduce
Please load the SDFG using the following program:

import dace
import os

run_on_gpu = True
sdfg_name = "calculate_nabla2_for_w_gpu.sdfg"
path = os.path.join(os.getcwd(), sdfg_name)

sdfg = dace.SDFG.from_file(path)

if run_on_gpu:
    device = dace.DeviceType.GPU
    sdfg._name = f"{sdfg.name}_gpu"
    for _, _, array in sdfg.arrays_recursive():
        if not array.transient:
            array.storage = dace.dtypes.StorageType.GPU_Global
else:
    device = dace.DeviceType.CPU

sdfg.compile(validate=True)

sdfg.zip

@edopao edopao self-assigned this Nov 22, 2023
@edopao edopao linked a pull request Nov 22, 2023 that will close this issue
github-merge-queue bot pushed a commit that referenced this issue Dec 18, 2023
This PR addresses #1388: fix python codegen and `SharedToGlobal1D`
template to generate correct code for write without reduction.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant