Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MPI issues #148

Closed
johnomotani opened this issue Nov 19, 2023 · 0 comments · Fixed by #150
Closed

MPI issues #148

johnomotani opened this issue Nov 19, 2023 · 0 comments · Fixed by #150
Labels
bug Something isn't working

Comments

@johnomotani
Copy link
Collaborator

johnomotani commented Nov 19, 2023

We get MPI errors if we run in parallel with the latest version of all the dependencies (as of 19/11/2023).

I think the problem is a slightly complicated mixup, that will probably be fixed in the Julia packages relatively soon:

  • The HDF5_jll.jl package (which bundles a version of HDF5 with Julia) has started supporting MPI (I think just if MPI.jl is being used, but that is always the case for us). However, it always links to the Julia-installed MPI (not the 'system-provided' one, which is what we usually use). Linking to two different MPI libraries causes errors. This is a bug in the Julia packages (not in moment_kinetics), see Crash with system-provided OpenMPI and HDF5_jll v1.14 JuliaIO/HDF5.jl#1079, Incorrect MPI augmentation if binary == "system" JuliaPackaging/Yggdrasil#6893.
  • For HDF5.jl the bug isn't actually a problem, because we tell HDF5 to use a 'system-provided' HDF5 library, which is (at least should be!) linked with the correct MPI library. This means that HDF5.jl doesn't actually use HDF5_jll.jl, so we wouldn't be affected by the bug, except that...
  • The NetCDF wrapper NCDatasets.jl doesn't support MPI, and does link to the HDF5 provided by HDF5_jll.jl. Previously this wasn't (or at least didn't seem to be) a problem - apparently linking two versions of libhdf5.so that are used in different places is OK - but now that HDF5_jll.jl links to MPI, it means linking two versions of the MPI library, which causes errors.

Possible workarounds:

  1. Wait for the Julia packages to be fixed, then the problem should go away.
  2. Pin HDF5_jll to a slightly older, working version (i.e. version 1.12.x) at least until the Julia packages are fixed.
  3. Get rid of the NetCDF file I/O.
  4. Tell NCDatasets to use a system-provided libnetcdf.so, so it doesn't link to the HDF5_jll.jl version of HDF5. On systems where we have to compile HDF5 for ourselves, this would be annoying as we would have to compile NetCDF as well, and link it to the local version of HDF5.

I think option 2 is the best and easiest solution, while we wait for a fix to JuliaPackaging/Yggdrasil#6893. I think it's possible to tell Julia to pin a package to a certain version, rather than everyone having to do it by hand (and even if we did it 'by hand' the CI jobs would have to do the same thing, which would probably be more work than pinning a package). I'll try to make a PR...

@johnomotani johnomotani added the bug Something isn't working label Nov 19, 2023
johnomotani added a commit that referenced this issue Nov 19, 2023
Version 1.14 of the HDF5_jll package causes a bug (see #148), so set the
`[compat]` section of Project.toml to exclude this version (hoping the
bug will be fixed by version 1.15 at the latest, some future patch
release like 1.14.3 might also be OK though).

HDF5_jll is not used directly in moment_kinetics, and so is only
included in the Project.toml in order to apply this version restriction.
It should probably be removed once the latest version has fixed the bug.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant