Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add distributed capabilities #1133

Merged
merged 204 commits into from
Oct 31, 2022
Merged

Add distributed capabilities #1133

merged 204 commits into from
Oct 31, 2022

Conversation

MarcelKoch
Copy link
Member

@MarcelKoch MarcelKoch commented Oct 5, 2022

This PR will add basic, distributed data structures (matrix and vector), and enable some solvers for these types. This PR contains the following PRs:

Changes

  • moves new types into experimental namespace
  • makes generic_scoped_device_id_guard destructor noexcept by terminating if restoring the original device id fails
  • switches to blocking communication in the SpMV if OpenMPI version 4.0.x is used
  • disables Horeka mpi tests and uses nla-gpu instead

@MarcelKoch MarcelKoch added the 1:ST:ready-for-review This PR is ready for review label Oct 5, 2022
@MarcelKoch MarcelKoch added this to the Ginkgo 1.5.0 milestone Oct 5, 2022
@MarcelKoch MarcelKoch self-assigned this Oct 5, 2022
@MarcelKoch MarcelKoch requested a review from a team October 5, 2022 07:45
@ginkgo-bot ginkgo-bot added mod:all This touches all Ginkgo modules. reg:benchmarking This is related to benchmarking. reg:build This is related to the build system. reg:documentation This is related to documentation. reg:example This is related to the examples. reg:helper-scripts This issue/PR is related to the helper scripts mainly concerned with development of Ginkgo. reg:testing This is related to testing. type:matrix-format This is related to the Matrix formats type:multigrid This is related to multigrid type:preconditioner This is related to the preconditioners type:reordering This is related to the matrix(LinOp) reordering type:solver This is related to the solvers type:stopping-criteria This is related to the stopping criteria labels Oct 5, 2022
@MarcelKoch
Copy link
Member Author

format!

@MarcelKoch MarcelKoch force-pushed the distributed-develop branch 2 times, most recently from f72dd76 to 699d4da Compare October 11, 2022 07:35
@yhmtsai
Copy link
Member

yhmtsai commented Oct 18, 2022

Do all commits belong to this pull request?

@MarcelKoch
Copy link
Member Author

@yhmtsai I would say only the recent ones that were not covered in the earlier PRs need to be reviewed. So only the commits after the last merge commit.

@yhmtsai
Copy link
Member

yhmtsai commented Oct 19, 2022

@MarcelKoch Due to too long commit history and repeated commit from earlier pr, it will be squashed commit when we merge it.
To keep these commits, could you create a backup branch and rebase this branch with dropping those merged commits?
In the end, copying all files from backup to this branch to ensure there's no missing change during rebasing.
you can still keep this branch as a backup and rebase in another branch.

Sorry, I misunderstand the messages.
Those commits are from another pull request into this branch not the develop branch, so we should keep the commits

CMakeLists.txt Show resolved Hide resolved
common/unified/solver/cg_kernels.cpp Show resolved Hide resolved
core/distributed/matrix.cpp Show resolved Hide resolved
Comment on lines +51 to +54
env({{"MV2_COMM_WORLD_LOCAL_RANK", ""},
{"OMPI_COMM_WORLD_LOCAL_RANK", ""},
{"MPI_LOCALRANKID", ""},
{"SLURM_LOCALID", ""}})
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does changing the environment after MPI_Init during the program has effect?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might. In each test I set the env variable explicitly before I call the mapping function. So if the env var is changed in between that then the test might fail. But I think for our containers that shouldn't be an issue.
Also, I reset the environment to what it was at the time SetUp is called, so it won't reset to the changes made in between.

core/test/mpi/gtest/mpi_listener.cpp Show resolved Hide resolved
include/ginkgo/config.hpp.in Show resolved Hide resolved
* @param recv_type the MPI_Datatype for the recv buffer
* @param comm the communicator
*/
void all_to_all_v(std::shared_ptr<const Executor> exec,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it should be like the above version using SendType and RecvType?
also scaled the offset by the type

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you expand on in which way this should be like the templated one? This one is needed for the custom MPI_Datatype to handle multiple right-hand-sides.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If MPI_Datatype is 8 bytes, will send_buffer+1 or send_buffer+2 (or the send_offsets 1 or 2) to access the send[1]?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that is an MPI detail that we don't need to care about.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to do because this function has the offset here?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we have two version for that

  1. with the templated type -> it will add the MPItype and then pass to the following option
  2. with the void type

From my view on this interface, I will pass the offset like (1, 2, 3) no matter what type is because it takes care of the type. However, I will consider to pass the offset like (2, 4, 6) when the type size is 8 bytes for the void version because I do not know whether the function will take care of the type size for the offset.

After reading some documentation and the implementation of 1, it seems to take care of the type size. The alltoallw is another version which only consider the offset in bytes.

It does not need to add the template on the void version but could you add some short documentation in void version?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How the offset is interpreted depends on the MPI_Datatype, I think you can read up on that somewhere in the MPI standard. So the offset is not in byte, but something depending on the datatype.
All of these functions are only wrapper for the corresponding MPI functions, so I don't think we should add documentation here. The only useful documentation I could think of is referring to the MPI standard, but that seems superfluous.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TBH, the documentation in MPI is not clear to me. At least from MPICH, they only specify MPI_Alltoallw is in byte, but not mention specific in MPI_Alltoallv. Without looking into MPI_alltoallw, I will not think MPI_alltoallv might using element-wise.
Okay, I found more clear document in openmpi and rookiehpc not like MPICH. they specify it is in units of sendtype
I mean in the param doc not the function description.
something like the element-wise offsets for the send buffer or the offsets (in units of send_type) for the send buffer for send_offsets in void version
If you think it is still confusing or might not match the mpi document, you can keep the original version.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't explain the MPI functions here, and I don't think it would be our job to do so. One option would be to remove the parameter description on all of our MPI wrapper and just keep the reference to the MPI standard. Would you be on board with that?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I already elaborated a bit on that on slack, but I think this function should at the very least have a different name, preferably be more specialized to allow making it type-safe instead of void*-typed and resolving the whole offset question.

include/ginkgo/core/distributed/vector.hpp Outdated Show resolved Hide resolved
test/mpi/distributed/matrix.cpp Outdated Show resolved Hide resolved
test/utils/mpi/executor.hpp Show resolved Hide resolved
@MarcelKoch MarcelKoch added the 1:ST:high-importance This issue/PR is of high importance and must be addressed as soon as possible. label Oct 20, 2022
@tcojean tcojean requested a review from yhmtsai October 20, 2022 15:13
MarcelKoch and others added 19 commits October 31, 2022 09:21
Co-authored-by: Marcel Koch <marcel.koch@kit.edu>
this is necessary (at least for the stopping status) because in MPI runs, some processes may have zero local dofs and thus the initialization would be skipped
On two different systems using openmpi 4.0.x results in deadlocks in our distributed solver test. For versions 4.1.[34] the deadlock disappears, and intel mpi and mvapich2 also don't show a deadlock.
- documentation
- set random engine seed explicitly

Co-authored-by: Yuhsiang M. Tsai <yhmtsai@gmail.com>
- disables test with UNIX calls on windows
- removes wrong documentation

Co-authored-by: Terry Cojean <terry.cojean@kit.edu>
Co-authored-by: Pratik Nayak <pratik.nayak@kit.edu>
- explicitly request MPI version 3.1

Co-authored-by: Yuhsiang M. Tsai <yhmtsai@gmail.com>
Co-authored-by: Terry Cojean <terry.cojean@kit.edu>
Co-authored-by: Yuhsiang M. Tsai <yhmtsai@gmail.com>
Co-authored-by: Marcel Koch <marcel.koch@kit.edu>
Co-authored-by: Tobias Ribizel <ribizel@kit.edu>
.gitlab-ci.yml Outdated Show resolved Hide resolved
@ginkgo-bot
Copy link
Member

Note: This PR changes the Ginkgo ABI:

Functions changes summary: 1266 Removed, 633 Changed (42119 filtered out), 2596 Added functions
Variables changes summary: 0 Removed, 0 Changed, 0 Added variable

For details check the full ABI diff under Artifacts here

@sonarcloud
Copy link

sonarcloud bot commented Oct 31, 2022

Kudos, SonarCloud Quality Gate passed!    Quality Gate passed

Bug A 0 Bugs
Vulnerability A 0 Vulnerabilities
Security Hotspot A 0 Security Hotspots
Code Smell A 255 Code Smells

81.6% 81.6% Coverage
5.8% 5.8% Duplication

@codecov
Copy link

codecov bot commented Oct 31, 2022

Codecov Report

Base: 90.87% // Head: 91.98% // Increases project coverage by +1.10% 🎉

Coverage data is based on head (b59a9dd) compared to base (cff9742).
Patch coverage: 95.73% of modified lines in pull request are covered.

Additional details and impacted files
@@             Coverage Diff             @@
##           develop    #1133      +/-   ##
===========================================
+ Coverage    90.87%   91.98%   +1.10%     
===========================================
  Files          508      535      +27     
  Lines        44294    46228    +1934     
===========================================
+ Hits         40254    42524    +2270     
+ Misses        4040     3704     -336     
Impacted Files Coverage Δ
common/unified/distributed/partition_kernels.cpp 100.00% <ø> (ø)
core/device_hooks/common_kernels.inc.cpp 0.00% <0.00%> (ø)
core/device_hooks/cuda_hooks.cpp 51.28% <0.00%> (-7.55%) ⬇️
core/device_hooks/dpcpp_hooks.cpp 23.68% <0.00%> (-3.59%) ⬇️
core/device_hooks/hip_hooks.cpp 51.28% <0.00%> (-7.55%) ⬇️
core/distributed/partition.cpp 97.82% <ø> (ø)
core/matrix/dense.cpp 93.78% <ø> (+0.43%) ⬆️
core/solver/bicg.cpp 82.35% <ø> (ø)
core/test/base/abstract_factory.cpp 100.00% <ø> (ø)
core/test/base/mtx_io.cpp 98.86% <ø> (ø)
... and 123 more

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

@MarcelKoch MarcelKoch merged commit c1f8bd4 into develop Oct 31, 2022
@MarcelKoch MarcelKoch deleted the distributed-develop branch October 31, 2022 21:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
1:ST:high-importance This issue/PR is of high importance and must be addressed as soon as possible. 1:ST:ready-to-merge This PR is ready to merge. mod:all This touches all Ginkgo modules. reg:benchmarking This is related to benchmarking. reg:build This is related to the build system. reg:documentation This is related to documentation. reg:example This is related to the examples. reg:helper-scripts This issue/PR is related to the helper scripts mainly concerned with development of Ginkgo. reg:testing This is related to testing. type:distributed-functionality type:matrix-format This is related to the Matrix formats type:multigrid This is related to multigrid type:preconditioner This is related to the preconditioners type:reordering This is related to the matrix(LinOp) reordering type:solver This is related to the solvers type:stopping-criteria This is related to the stopping criteria
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants