GPU performance issues #794

GVigne · 2022-12-05T09:12:06Z

I've noted some performance issues when doing GPU computations with DFTK. I'm opening this issue to gather ideas, because so far I haven't managed to solve them (perhaps @vchuravy can help?).

I launched the SCF solver for a model with a supercell of parameters (4,5,2). I am using Float32. On my computer, this takes about 22 seconds, and I noticed the following in the timer for the LOBPCG routine:

ortho! X vs Y                       60    4.60s   20.6%  76.7ms   48.1MiB   11.1%   821KiB
    drop!                            136    2.99s   13.4%  22.0ms   1.44MiB    0.3%  10.9KiB

The current implementation for drop! is actually quite bad for GPUs:

function drop!(X::AbstractArray{T}, tol=2eps(real(T))) where {T}
    dropped = Int[]
    for i=1:size(X,2)
        n = norm(@views X[:,i])
        if n <= tol
            X[:,i] = randn(T, size(X,1))
            push!(dropped, i)
        end
    end
    dropped
end

This launches a kernel for each column of X, so it's not a surprise a lot of time is spent in this loop. drop! happens very often (more than 130 times in this case), however, having to actually drop a column by randomising it is rather uncommon. X is a very "skinny" matrix: in this case, it's size is around (84751,194).

My idea was to vectorize the norm computation: however, doing this also forces us to bring the norms array back on the CPU, and we barely win any computational time. Instead, I would like to also vectorize the tolerance check (using any or mapreduce), and only do the GPU -> CPU transfer if needed. My actual code looks like this

X_norms = sqrt.(sum(abs2, X, dims = 1))[1,:]
to_drop = any(x -> x<=tol, X_norms)
if to_drop
    X_norms = Array(X_norms) # Bring X_norms back to the CPU
    for (i,n) in enumerate(X_norms)
      if i
         X[:,i] = randn(T, size(X,1))
            push!(dropped, i)
      end
    end

Surprisingly, this doesn't reduce computational time. The norms get computed much faster, which is expected, but the tolerance check still takes a lot of time. When I used the @time macro on the line doing the tolerance check, I noticed something rather odd:

0.000025 seconds (43 allocations: 2.266 KiB)
0.000023 seconds (43 allocations: 2.266 KiB)
0.012295 seconds (90 allocations: 5.188 KiB)
0.012174 seconds (90 allocations: 5.188 KiB)

Since X_norms is a relatively small vector (around 200 elements), I expected it to always run in around 25 µs, not in 12ms. Initially, I thought the anonymous function would get recompiled every time, but it doesn't explain why some calls are much faster than others. Moving the anonymous function out and giving it a name doesn't change anything. Why is there such a difference between calls? Is this a synchronisation issue?

The other performance issue I noticed is in the fft! function. Most of the time is spent doing the following:
f_fourier .= view(f_real, kpt.mapping)
Each call is actually rather fast (about 800 µs), but since we call the fft thousands of times, it adds up quickly: for this example, doing ffts for the local and kinetic terms takes about 4 seconds (16% of computational time), and these 4 seconds are almost exclusively spent doing the operation involving view above.
I think this is expected, as doing this view more or less comes to scalar indexing (when only want to pick a few elements, based on their position in an array), so there is probably nothing much we can do on the GPU side. However, if someone has an idea to get rid of the view, it would be great.

The text was updated successfully, but these errors were encountered:

antoine-levitt · 2022-12-05T09:22:00Z

CC @haampie who might have some ideas on the second part. It's a common operation in DFT codes so probably QE has had this issue. @GVigne did you check that it dispatched to a special-purpose kernel (ie no generic fallback)?

GVigne · 2022-12-05T10:14:40Z

It is indeed dispatched to a specific kernel in CUDA (this one).

antoine-levitt · 2022-12-05T10:17:27Z

The comment suggests the boundscheck is expensive, can you try putting @inbounds in front?

GVigne · 2022-12-05T11:58:12Z

I tried that, but it didn't change anything.

vchuravy · 2022-12-05T16:26:40Z

cc: @tkf @maleadt

One thing I would first do is to use NSight systems to profile your application https://cuda.juliagpu.org/stable/development/profiling/#NVIDIA-Nsight-Systems

When I used the @time macro on the line doing the tolerance check, I noticed something rather odd:

You likely need CUDA.@sync see https://cuda.juliagpu.org/stable/development/profiling/#Time-measurements

maleadt · 2022-12-05T22:03:03Z

I tried that, but it didn't change anything.

Did you put the @inbounds around the @view? The bounds checks it does are ridiculously expensive, JuliaGPU/CUDA.jl#1678, and should probably be changed.

GVigne · 2022-12-06T10:54:31Z

I tried that, but it didn't change anything.

Did you put the @inbounds around the @view? The bounds checks it does are ridiculously expensive, JuliaGPU/CUDA.jl#1678, and should probably be changed.

Yeah, sorry, I wasn't being very explicit. I tried to put the @inbounds around the view, by doing:
f_fourier .= @inbounds view(f_real, kpt.mapping)
This doesn't seem to change the timings much though, as most of the time gets spent doing this allocation. I also tried to manually comment out the code in CUDA.jl to make sure the bounds really were checked, but this didn't change anything either. I must admit I don't really know what's going on...

antoine-levitt · 2022-12-06T11:34:07Z

f_fourier .= @inbounds view(f_real, kpt.mapping)

No actual work is being done by the view, so it's possible this doesn't do anything. Did you try @inbounds f_fourier .=? (or @inbounds begin ... end if you want to be sure)

GVigne · 2022-12-06T11:58:55Z

Yes, I used both but that didn't change the timers. I'm currently looking into the source code of TimerOutputs, because I'm starting to doubt the time it's measuring...

antoine-levitt · 2022-12-06T12:02:56Z

Ah yes sorry, the bounds check is indeed done within the view:

julia> x = [1;2]; view(x, 3:4); nothing;
ERROR: BoundsError: attempt to access 2-element Vector{Int64} at index [3:4]

GVigne mentioned this issue Dec 9, 2022

GPU TODO list #798

Open

10 tasks

mfherbst added the gpu Label for GPU-related issues label Jan 30, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GPU performance issues #794

GPU performance issues #794

GVigne commented Dec 5, 2022

antoine-levitt commented Dec 5, 2022

GVigne commented Dec 5, 2022

antoine-levitt commented Dec 5, 2022

GVigne commented Dec 5, 2022

vchuravy commented Dec 5, 2022

maleadt commented Dec 5, 2022

GVigne commented Dec 6, 2022

antoine-levitt commented Dec 6, 2022

GVigne commented Dec 6, 2022

antoine-levitt commented Dec 6, 2022

GPU performance issues #794

GPU performance issues #794

Comments

GVigne commented Dec 5, 2022

antoine-levitt commented Dec 5, 2022

GVigne commented Dec 5, 2022

antoine-levitt commented Dec 5, 2022

GVigne commented Dec 5, 2022

vchuravy commented Dec 5, 2022

maleadt commented Dec 5, 2022

GVigne commented Dec 6, 2022

antoine-levitt commented Dec 6, 2022

GVigne commented Dec 6, 2022

antoine-levitt commented Dec 6, 2022