-
Notifications
You must be signed in to change notification settings - Fork 89
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GPU performance issues #794
Comments
It is indeed dispatched to a specific kernel in CUDA (this one). |
The comment suggests the boundscheck is expensive, can you try putting |
I tried that, but it didn't change anything. |
One thing I would first do is to use NSight systems to profile your application https://cuda.juliagpu.org/stable/development/profiling/#NVIDIA-Nsight-Systems
You likely need |
Did you put the |
Yeah, sorry, I wasn't being very explicit. I tried to put the |
No actual work is being done by the |
Yes, I used both but that didn't change the timers. I'm currently looking into the source code of TimerOutputs, because I'm starting to doubt the time it's measuring... |
Ah yes sorry, the bounds check is indeed done within the view:
|
I've noted some performance issues when doing GPU computations with DFTK. I'm opening this issue to gather ideas, because so far I haven't managed to solve them (perhaps @vchuravy can help?).
I launched the SCF solver for a model with a supercell of parameters
(4,5,2)
. I am usingFloat32
. On my computer, this takes about 22 seconds, and I noticed the following in the timer for the LOBPCG routine:The current implementation for
drop!
is actually quite bad for GPUs:This launches a kernel for each column of
X
, so it's not a surprise a lot of time is spent in this loop.drop!
happens very often (more than 130 times in this case), however, having to actually drop a column by randomising it is rather uncommon. X is a very "skinny" matrix: in this case, it's size is around (84751,194).My idea was to vectorize the norm computation: however, doing this also forces us to bring the norms array back on the CPU, and we barely win any computational time. Instead, I would like to also vectorize the tolerance check (using
any
ormapreduce
), and only do the GPU -> CPU transfer if needed. My actual code looks like thisSurprisingly, this doesn't reduce computational time. The norms get computed much faster, which is expected, but the tolerance check still takes a lot of time. When I used the
@time
macro on the line doing the tolerance check, I noticed something rather odd:Since
X_norms
is a relatively small vector (around 200 elements), I expected it to always run in around 25 µs, not in 12ms. Initially, I thought the anonymous function would get recompiled every time, but it doesn't explain why some calls are much faster than others. Moving the anonymous function out and giving it a name doesn't change anything. Why is there such a difference between calls? Is this a synchronisation issue?The other performance issue I noticed is in the
fft!
function. Most of the time is spent doing the following:f_fourier .= view(f_real, kpt.mapping)
Each call is actually rather fast (about 800 µs), but since we call the fft thousands of times, it adds up quickly: for this example, doing ffts for the local and kinetic terms takes about 4 seconds (16% of computational time), and these 4 seconds are almost exclusively spent doing the operation involving
view
above.I think this is expected, as doing this view more or less comes to scalar indexing (when only want to pick a few elements, based on their position in an array), so there is probably nothing much we can do on the GPU side. However, if someone has an idea to get rid of the view, it would be great.
The text was updated successfully, but these errors were encountered: