Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add GPU CI via gitlab #796

Merged
merged 16 commits into from
Dec 12, 2022
Merged

Add GPU CI via gitlab #796

merged 16 commits into from
Dec 12, 2022

Conversation

mfherbst
Copy link
Member

@mfherbst mfherbst commented Dec 7, 2022

Setup with the help of @carstenbauer.

@GVigne I will merge the dummy test first and you can add more sophisticated testing later.

TODOs:

  • Get coverage submission to work
  • The pipline fails even though the test is successful. Any idea @carstenbauer ?
  • Sometimes the jobs do not seem to get queued and the job stays in the submission stage forever. So not all yet that smooth ... (Update: Seems to work better now)

@carstenbauer
Copy link
Contributor

carstenbauer commented Dec 12, 2022

  • The pipline fails even though the test is successful. Any idea @carstenbauer ?

Maybe the last FAQ issue? In short, remove clear from your ~/.bash_logout.

@carstenbauer
Copy link
Contributor

carstenbauer commented Dec 12, 2022

┌ Warning: The JULIA_MPI_BINARY environment variable is no longer used to configure the MPI binary.
│ Please use the MPIPreferences.jl package instead:
│ 
│     MPIPreferences.use_system_binary()  # use the system binary
│     MPIPreferences.use_jll_binary()     # use JLL binary
│ 
│ See https://juliaparallel.org/MPI.jl/stable/configuration/ for more details
│ 
│   ENV["JULIA_MPI_BINARY"] = "system"
│   MPIPreferences.binary = "MPICH_jll"
└ @ MPI ~/.julia/packages/MPI/5cAQG/src/MPI.jl:90

This is strange and shouldn't appear. In a fresh shell on Noctua 2 I get no warning:

➜  bauerc@n2login3 ~  module load JuliaHPC/1.8.3-foss-2022a-CUDA-11.7.0

➜  bauerc@n2login3 ~  julia -q
(@v1.8) pkg> activate --temp
  Activating new project at `/tmp/jl_cXgyqb`

(jl_cXgyqb) pkg> add MPI
   Resolving package versions...
   Installed DocStringExtensions ─ v0.9.3
    Updating `/tmp/jl_cXgyqb/Project.toml`
  [da04e1cc] + MPI v0.20.5
    Updating `/tmp/jl_cXgyqb/Manifest.toml`
  [ffbed154] + DocStringExtensions v0.9.3
  [692b3bcd] + JLLWrappers v1.4.1
  [da04e1cc] + MPI v0.20.5
  [3da0fdf6] + MPIPreferences v0.1.7
  [21216c6a] + Preferences v1.3.0
  [ae029012] + Requires v1.3.0
  [7cb0a576] + MPICH_jll v4.0.2+5
  [f1f71cc9] + MPItrampoline_jll v5.0.2+1
  [9237b28f] + MicrosoftMPI_jll v10.1.3+2
  [fe0851c0] + OpenMPI_jll v4.1.3+3
  [0dad84c5] + ArgTools v1.1.1
  [56f22d72] + Artifacts
  [2a0f44e3] + Base64
  [ade2ca70] + Dates
  [8ba89e20] + Distributed
  [f43a241f] + Downloads v1.6.0
  [7b1f6079] + FileWatching
  [b77e0a4c] + InteractiveUtils
  [4af54fe1] + LazyArtifacts
  [b27032c2] + LibCURL v0.6.3
  [76f85450] + LibGit2
  [8f399da3] + Libdl
  [56ddb016] + Logging
  [d6f4376e] + Markdown
  [ca575930] + NetworkOptions v1.2.0
  [44cfe95a] + Pkg v1.8.0
  [de0858da] + Printf
  [3fa0cd96] + REPL
  [9a3f8284] + Random
  [ea8e919c] + SHA v0.7.0
  [9e88b42a] + Serialization
  [6462fe0b] + Sockets
  [fa267f1f] + TOML v1.0.0
  [a4e569a6] + Tar v1.10.1
  [cf7118a7] + UUIDs
  [4ec0a83e] + Unicode
  [e66e0078] + CompilerSupportLibraries_jll v0.5.2+0
  [deac9b47] + LibCURL_jll v7.84.0+0
  [29816b5a] + LibSSH2_jll v1.10.2+0
  [c8ffd9c3] + MbedTLS_jll v2.28.0+0
  [14a3606d] + MozillaCACerts_jll v2022.2.1
  [83775a58] + Zlib_jll v1.2.12+3
  [8e850ede] + nghttp2_jll v1.48.0+0
  [3f19e933] + p7zip_jll v17.4.0+0

julia> using MPI

julia> MPI.Init()
MPI.ThreadLevel(2)

I suggest you try to run your tests on the cluster outside of CI to see if you get the same output. I.e. use srun -N 1 -n 1 -c 16 --gres=gpu:a100:1 -t 00:15:00 -A hpc-prf-dftkjl -p gpu --pty bash to get an interactive session on a gpu node and then just load the JuliaHPC module and run your ] test.

@mfherbst
Copy link
Member Author

@carstenbauer Thanks for your help. To comment on your remarks:

Maybe the last FAQ issue? In short, remove clear from your ~/.bash_logout.

Seems to have worked.

I suggest you try to run your tests on the cluster outside of CI to see if you get the same output.

Yes I get the same output. Both with the tests I run in the gitlab runner as well as a simple ] test as you suggest.

@mfherbst mfherbst mentioned this pull request Dec 12, 2022
10 tasks
@carstenbauer
Copy link
Contributor

Seems to have worked.

Great.

Yes I get the same output. Both with the tests I run in the gitlab runner as well as a simple ] test as you suggest.

Curious. I guess it doesn't really matter, it only opts to use an JLL MPI instead of the system one, but I might debug this when I find the time. If you want to fix it yourself (locally), simply following the instructions, i.e. creating a local preferences TOML via MPIPreferences.jl, should work. But I thought I had worked this out globally...

@carstenbauer
Copy link
Contributor

BTW, coverage seems to have worked as well AFAICT.

@mfherbst
Copy link
Member Author

If you want to fix it yourself (locally), simply following the instructions, i.e. creating a local preferences TOML via MPIPreferences.jl, should work.

Ok, I think for now we won't do MPI and GPU, so I'll just ignore it.

@mfherbst
Copy link
Member Author

BTW, coverage seems to have worked as well AFAICT.

Sorry. I don't understand what you are referring to. It should basically work or it has worked on this setup successfully already?

@carstenbauer
Copy link
Contributor

carstenbauer commented Dec 12, 2022

It, that is, the coverage.jl script, has worked already. If you look at the job output, at the very bottom, it says (28.793774319066145%) covered. That's what I meant.

UPDATE: I see you have commented out the submit part. But I'm optimistic this works as well.

@mfherbst mfherbst merged commit 2b45e5b into master Dec 12, 2022
@mfherbst mfherbst deleted the add-gitlab branch December 12, 2022 13:46
@carstenbauer
Copy link
Contributor

carstenbauer commented Dec 12, 2022

FYI, Ich meine das hier ist die korrekte Badge für die CI: https://git.uni-paderborn.de/herbstm/DFTK.jl/badges/master/pipeline.svg?key_text=CI@PC2

badge

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants