Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tpetra: RCP Thread-safety issues with parallel host backends without setting Teuchos_ENABLE_THREAD_SAFE=ON #11921

Closed
ndellingwood opened this issue May 25, 2023 · 21 comments
Labels
CLOSED_DUE_TO_INACTIVITY Issue or PR has been closed by the GitHub Actions bot due to inactivity. MARKED_FOR_CLOSURE Issue or PR is marked for auto-closure by the GitHub Actions bot. pkg: Teuchos Issues primarily dealing with the Teuchos Package pkg: Tpetra type: bug The primary issue is a bug in Trilinos code or tests

Comments

@ndellingwood
Copy link
Contributor

Bug Report

@trilinos/tpetra

This issue is to relay info from kokkos/kokkos#6082 regarding some RCP thread-safety issues encountered testing Tpetra with Kokkos@develop branch with the OpenMP backend (a couple different testing configurations are posted in the issue)

The problem seems to occur with kernels that capture RCPs, creating multiple copies within OpenMP parallel regions which is not thread-safe without setting Teuchos_ENABLE_THREAD_SAFE=ON , e.g.

Kokkos::parallel_scan
("Tpetra::CrsGraph::packFillActiveNew: Pack exports",
inputRange, [=, &prefix]
(const LO i, size_t& exportsOffset, const bool final) {
const size_t curOffset = exportsOffset;
const LO lclRow = exportLIDs_h(i);
const GO gblRow = rowMap.getGlobalElement (lclRow);
if (gblRow == Details::OrdinalTraits<GO>::invalid ()) {
if (verbose) {
std::ostringstream os;
os << *prefix << "For i=" << i << ", lclRow=" << lclRow
<< " not in row Map on this process" << endl;
std::cerr << os.str();
}
Kokkos::atomic_add (&errCountView(), ONE);
return;
}
const RowInfo rowInfo = this->getRowInfoFromGlobalRowIndex (gblRow);
if (rowInfo.localRow == Details::OrdinalTraits<size_t>::invalid ()) {
if (verbose) {
std::ostringstream os;
os << *prefix << "For i=" << i << ", lclRow=" << lclRow
<< ", gblRow=" << gblRow << ": invalid rowInfo"
<< endl;
std::cerr << os.str();
}
Kokkos::atomic_add (&errCountView(), ONE);
return;
}
if (curOffset + rowInfo.numEntries > totalNumPackets) {
if (verbose) {
std::ostringstream os;
os << *prefix << "For i=" << i << ", lclRow=" << lclRow
<< ", gblRow=" << gblRow << ", curOffset (= "
<< curOffset << ") + numEnt (= " << rowInfo.numEntries
<< ") > totalNumPackets (= " << totalNumPackets
<< ")." << endl;
std::cerr << os.str();
}
Kokkos::atomic_add (&errCountView(), ONE);
return;
}
const LO numEnt = static_cast<LO> (rowInfo.numEntries);
if (this->isLocallyIndexed ()) {
auto lclColInds = getLocalIndsViewHost(rowInfo);
if (final) {
for (LO k = 0; k < numEnt; ++k) {
const LO lclColInd = lclColInds(k);
const GO gblColInd = colMapPtr->getGlobalElement (lclColInd);
// Pack it, even if it's wrong. Let the receiving
// process deal with it. Otherwise, we'll miss out
// on any correct data.
exports_h(curOffset + k) = gblColInd;
} // for each entry in the row
} // final pass?
exportsOffset = curOffset + numEnt;
}
else if (this->isGloballyIndexed ()) {
auto gblColInds = getGlobalIndsViewHost(rowInfo);
if (final) {
for (LO k = 0; k < numEnt; ++k) {
const GO gblColInd = gblColInds(k);
// Pack it, even if it's wrong. Let the receiving
// process deal with it. Otherwise, we'll miss out
// on any correct data.
exports_h(curOffset + k) = gblColInd;
} // for each entry in the row
} // final pass?
exportsOffset = curOffset + numEnt;
}
// If neither globally nor locally indexed, then the graph
// has no entries in this row (or indeed, in any row on this
// process) to pack.
});

A couple potential options to resolve this could be to set Teuchos_ENABLE_THREAD_SAFE=ON as default, or make sure no RCPs are passed to Kokkos' parallel_* kernels, or ?

Adding @masterleinad who helped triage this

@ndellingwood ndellingwood added type: bug The primary issue is a bug in Trilinos code or tests pkg: Tpetra labels May 25, 2023
@jhux2 jhux2 added the pkg: Teuchos Issues primarily dealing with the Teuchos Package label May 26, 2023
@jhux2
Copy link
Member

jhux2 commented May 26, 2023

@trilinos/teuchos

@bartlettroscoe
Copy link
Member

One would have to do some profiling to see the impact of turning on thread safety for RCP, especially in these low-level routines. Otherwise, I don't have a strong opinion one way or the other.

@masterleinad
Copy link
Contributor

Related to #11931.

@nmm0
Copy link

nmm0 commented May 31, 2023

Even if Teuchos_ENABLE_THREAD_SAFE=ON has too much of a performance implication if enabled globally, it's still necessary to make sure that anything accessed in a parallel kernel is thread safe. So either in this case we need to make sure the RCP is not passed in to the kernel (so the refcount won't race) or to correctly guard RCP in this instance with some sort of lock.

On the Kokkos side, we can't guarantee that thread-unsafe code works.

@csiefer2
Copy link
Member

csiefer2 commented Jun 5, 2023

Would a configure-time check in Tpetra requiring Teuchos_ENABLE_THREAD_SAFE=ON if you're using a host-threaded backend work for people? @masterleinad @ndellingwood @bartlettroscoe

If someone can think of a code-based way to ensure no RCP capture by kernels, that could work too.

@masterleinad
Copy link
Contributor

Would a configure-time check in Tpetra requiring Teuchos_ENABLE_THREAD_SAFE=ON if you're using a host-threaded backend work for people?

Sounds reasonable if it doesn't break users' configurations.

@crtrott
Copy link
Member

crtrott commented Jun 5, 2023

We do need a decision on this: we do not promise not to make a copy of the functor per thread inside backends, and I think it would be a serious implementation constraint issue if we did.

@csiefer2
Copy link
Member

csiefer2 commented Jun 5, 2023

As per discussions with @crtrott the most reasonable path forward in the short term is:

Have Teuchos check to see if a threaded Kokkos backend is enabled and require Teuchos_ENABLE_THREAD_SAFE=ON in that case.

Longer term we should consider making Teuchos_ENABLE_THREAD_SAFE=ON the default and make the user turn if off manually if they're sure they don't need it.

@bartlettroscoe is the short term plan OK?

@bartlettroscoe
Copy link
Member

@bartlettroscoe is the short term plan OK?

@csiefer2, that sounds reasonable to me.

I just wish Trilinos had a comprehensive and robust performance test suite so we could see the performance impact this will have (before we hear it from users if there is an issue with some algorithms). But note that I suspect that a good bit of this performance critical software could be refactored to use semi-persisting associations (using Teuchos::Ptr<T> instead of Teuchos::RCP<T>) and avoid passing RCPs into such code (see section "5.12.3 Performance tuning strategies, semi-persisting associations" in here).

@csiefer2
Copy link
Member

csiefer2 commented Jun 6, 2023

@bartlettroscoe I can do a PR and mark you as a reviewer.

Our current performance suite is really just designed to spot app-specific regressions, rather than absolutely everything. So, in theory, if this effects apps which are tested we should see performance differences there. You know, in theory ;)

@bartlettroscoe
Copy link
Member

@bartlettroscoe I can do a PR and mark you as a reviewer.

👍

bartlettroscoe added a commit that referenced this issue Jun 7, 2023
The new test
`TrilinosInstallTests_simpleBuildAgainstTrilinos_by_package_build_tree` merged
from PR #11863 fails because the subdirs ${CMAKE_CURRENT_BINARY_DIR}/common
and ${CMAKE_CURRENT_SOURCE_DIR}/common because this CMakeLists.txt file
already sits in the kokkos-kernels/common/ subdir.

I don't know why this error did not happen with PR testing for PR #11863 but
this is clearly the right thing to do.
jmgate pushed a commit to tcad-charon/Trilinos that referenced this issue Jun 7, 2023
…s:develop' (ab899a0).

* trilinos-develop: (22 commits)
  Remove non-existant subdir kokkos-kernels/common/common (trilinos#11921, trilinos#11863)
  Teuchos: Fixing cmake logic
  Teuchos: Fixing catch() issues with C++ language drift
  TrilinosSS: include <omp.h> (Fix trilinos#11867)
  MueLu hierarchical: Fix build error
  Tpetra: Changes to StaticView for Kokkos PTHREAD to THREADS change
  Teuchos: Automatically enabling Tecuhos_ENABLE_THREAD_SAFE if you have Kokkos THREADS or OPENMP for the host
  Stokhos:  Add missing KOKKOS_INLINE_FUNCTION to fix build errors on HIP
  Phalanx: Remove usage of undefined var Kokkos_INCLUDE_DIRS (trilinos#11545)
  Kokkos: Mark HWLOC as a TriBITS TPL as well (trilinos#11938)
  Update for removal of Kokkos subpackages and Kokkos test renamings (trilinos#11545, trilinos#11808)
  KokkosKernels: Remove non-existent common/src/[impl,tpls] include dirs (trilinos#11545)
  Add test simpleBuildAgainstTrilinos_by_package_build_tree_name (trilinos#11545)
  Pass in and define compilers before calling find_package(Trilinos) (trilinos#11545)
  Add `Kokkos::all_libs` alias target for compatibility with TriBITS/Trilinos (trilinos#6157)
  Export Kokkos_ENABLE_<OPTION> that are relevant
  Do not append to Kokkos_OPTIONS variables those in the do not export list
  Expand list of kokkos options not to export with cmake
  Tpetra: Don't use std::binary_function
  Tpetra: Fixing missing HIP tesT
  ...
jmgate pushed a commit to tcad-charon/Trilinos that referenced this issue Jun 7, 2023
…s:develop' (ab899a0).

* trilinos-develop: (22 commits)
  Remove non-existant subdir kokkos-kernels/common/common (trilinos#11921, trilinos#11863)
  Teuchos: Fixing cmake logic
  Teuchos: Fixing catch() issues with C++ language drift
  TrilinosSS: include <omp.h> (Fix trilinos#11867)
  MueLu hierarchical: Fix build error
  Tpetra: Changes to StaticView for Kokkos PTHREAD to THREADS change
  Teuchos: Automatically enabling Tecuhos_ENABLE_THREAD_SAFE if you have Kokkos THREADS or OPENMP for the host
  Stokhos:  Add missing KOKKOS_INLINE_FUNCTION to fix build errors on HIP
  Phalanx: Remove usage of undefined var Kokkos_INCLUDE_DIRS (trilinos#11545)
  Kokkos: Mark HWLOC as a TriBITS TPL as well (trilinos#11938)
  Update for removal of Kokkos subpackages and Kokkos test renamings (trilinos#11545, trilinos#11808)
  KokkosKernels: Remove non-existent common/src/[impl,tpls] include dirs (trilinos#11545)
  Add test simpleBuildAgainstTrilinos_by_package_build_tree_name (trilinos#11545)
  Pass in and define compilers before calling find_package(Trilinos) (trilinos#11545)
  Add `Kokkos::all_libs` alias target for compatibility with TriBITS/Trilinos (trilinos#6157)
  Export Kokkos_ENABLE_<OPTION> that are relevant
  Do not append to Kokkos_OPTIONS variables those in the do not export list
  Expand list of kokkos options not to export with cmake
  Tpetra: Don't use std::binary_function
  Tpetra: Fixing missing HIP tesT
  ...
jmgate pushed a commit to tcad-charon/Trilinos that referenced this issue Jun 7, 2023
…s:develop' (ab899a0).

* trilinos-develop: (22 commits)
  Remove non-existant subdir kokkos-kernels/common/common (trilinos#11921, trilinos#11863)
  Teuchos: Fixing cmake logic
  Teuchos: Fixing catch() issues with C++ language drift
  TrilinosSS: include <omp.h> (Fix trilinos#11867)
  MueLu hierarchical: Fix build error
  Tpetra: Changes to StaticView for Kokkos PTHREAD to THREADS change
  Teuchos: Automatically enabling Tecuhos_ENABLE_THREAD_SAFE if you have Kokkos THREADS or OPENMP for the host
  Stokhos:  Add missing KOKKOS_INLINE_FUNCTION to fix build errors on HIP
  Phalanx: Remove usage of undefined var Kokkos_INCLUDE_DIRS (trilinos#11545)
  Kokkos: Mark HWLOC as a TriBITS TPL as well (trilinos#11938)
  Update for removal of Kokkos subpackages and Kokkos test renamings (trilinos#11545, trilinos#11808)
  KokkosKernels: Remove non-existent common/src/[impl,tpls] include dirs (trilinos#11545)
  Add test simpleBuildAgainstTrilinos_by_package_build_tree_name (trilinos#11545)
  Pass in and define compilers before calling find_package(Trilinos) (trilinos#11545)
  Add `Kokkos::all_libs` alias target for compatibility with TriBITS/Trilinos (trilinos#6157)
  Export Kokkos_ENABLE_<OPTION> that are relevant
  Do not append to Kokkos_OPTIONS variables those in the do not export list
  Expand list of kokkos options not to export with cmake
  Tpetra: Don't use std::binary_function
  Tpetra: Fixing missing HIP tesT
  ...
@csiefer2
Copy link
Member

csiefer2 commented Jun 7, 2023

@ndellingwood My code change to Teuchos merged. Does this meet your needs?

@ndellingwood
Copy link
Contributor Author

@csiefer2 mostly, thanks for working on this! There were some changes to kokkos-kernels that need a matching PR to the kokkos-kernels repo (to avoid clobber with the next release), once that goes in we should be in good shape

I enabled Teuchos_ENABLE_THREAD_SAFE=ON in our nightly OpenMP builds to verify the setting would resolve the failures in kokkos/kokkos#6082, so these changes should take care of it. Thanks!

jmgate pushed a commit to tcad-charon/Trilinos that referenced this issue Jun 7, 2023
…s:develop' (ab899a0).

* trilinos-develop: (23 commits)
  Remove non-existant subdir kokkos-kernels/common/common (trilinos#11921, trilinos#11863)
  Teuchos: Fixing cmake logic
  Teuchos: Fixing catch() issues with C++ language drift
  fastilu: Fix memory leak.
  TrilinosSS: include <omp.h> (Fix trilinos#11867)
  MueLu hierarchical: Fix build error
  Tpetra: Changes to StaticView for Kokkos PTHREAD to THREADS change
  Teuchos: Automatically enabling Tecuhos_ENABLE_THREAD_SAFE if you have Kokkos THREADS or OPENMP for the host
  Stokhos:  Add missing KOKKOS_INLINE_FUNCTION to fix build errors on HIP
  Phalanx: Remove usage of undefined var Kokkos_INCLUDE_DIRS (trilinos#11545)
  Kokkos: Mark HWLOC as a TriBITS TPL as well (trilinos#11938)
  Update for removal of Kokkos subpackages and Kokkos test renamings (trilinos#11545, trilinos#11808)
  KokkosKernels: Remove non-existent common/src/[impl,tpls] include dirs (trilinos#11545)
  Add test simpleBuildAgainstTrilinos_by_package_build_tree_name (trilinos#11545)
  Pass in and define compilers before calling find_package(Trilinos) (trilinos#11545)
  Add `Kokkos::all_libs` alias target for compatibility with TriBITS/Trilinos (trilinos#6157)
  Export Kokkos_ENABLE_<OPTION> that are relevant
  Do not append to Kokkos_OPTIONS variables those in the do not export list
  Expand list of kokkos options not to export with cmake
  Tpetra: Don't use std::binary_function
  ...
jmgate pushed a commit to tcad-charon/Trilinos that referenced this issue Jun 7, 2023
…s:develop' (ab899a0).

* trilinos-develop: (23 commits)
  Remove non-existant subdir kokkos-kernels/common/common (trilinos#11921, trilinos#11863)
  Teuchos: Fixing cmake logic
  Teuchos: Fixing catch() issues with C++ language drift
  fastilu: Fix memory leak.
  TrilinosSS: include <omp.h> (Fix trilinos#11867)
  MueLu hierarchical: Fix build error
  Tpetra: Changes to StaticView for Kokkos PTHREAD to THREADS change
  Teuchos: Automatically enabling Tecuhos_ENABLE_THREAD_SAFE if you have Kokkos THREADS or OPENMP for the host
  Stokhos:  Add missing KOKKOS_INLINE_FUNCTION to fix build errors on HIP
  Phalanx: Remove usage of undefined var Kokkos_INCLUDE_DIRS (trilinos#11545)
  Kokkos: Mark HWLOC as a TriBITS TPL as well (trilinos#11938)
  Update for removal of Kokkos subpackages and Kokkos test renamings (trilinos#11545, trilinos#11808)
  KokkosKernels: Remove non-existent common/src/[impl,tpls] include dirs (trilinos#11545)
  Add test simpleBuildAgainstTrilinos_by_package_build_tree_name (trilinos#11545)
  Pass in and define compilers before calling find_package(Trilinos) (trilinos#11545)
  Add `Kokkos::all_libs` alias target for compatibility with TriBITS/Trilinos (trilinos#6157)
  Export Kokkos_ENABLE_<OPTION> that are relevant
  Do not append to Kokkos_OPTIONS variables those in the do not export list
  Expand list of kokkos options not to export with cmake
  Tpetra: Don't use std::binary_function
  ...
@ndellingwood
Copy link
Contributor Author

Matching patch with the changes from 21e643d to kokkos-kernels is up kokkos/kokkos-kernels#1854

@bartlettroscoe
Copy link
Member

@csiefer2 mostly, thanks for working on this! There were some changes to kokkos-kernels that need a matching PR to the kokkos-kernels repo (to avoid clobber with the next release), once that goes in we should be in good shape

@ndellingwood, sorry I did not follow up on that right after I made the change and thanks for catching this. (Not sure why this was not caught in the testing of PR #11863).

@ndellingwood
Copy link
Contributor Author

@bartlettroscoe no worries, thanks for the quick fix with 21e643d 👍

@ndellingwood
Copy link
Contributor Author

kokkos/kokkos-kernels#1854 is merged
In an OpenMP nightly build (gcc/8.3 on Power9) with Kokkos Core+Kernels develop branches there was a performance test failure with TeuchosCore_RCP_PerformanceTests_basic_MPI_1 following merge of this PR. The build does not run unit tests with parallelism (i.e. ctest -j1). If the test is enabled by PR testing should it be disabled to avoid spurious failures? @bartlettroscoe

@bartlettroscoe
Copy link
Member

In an OpenMP nightly build (gcc/8.3 on Power9) with Kokkos Core+Kernels develop branches there was a performance test failure with TeuchosCore_RCP_PerformanceTests_basic_MPI_1 following merge of this PR.

@ndellingwood, can you run make dashboard and post this to CDash?

@ndellingwood
Copy link
Contributor Author

@ndellingwood, can you run make dashboard and post this to CDash?

@bartlettroscoe I don't have the nightly build configured to post to CDash, it's on my list of TODOs. Is it helpful if I point you to the Jenkins build for now? https://jenkins-son.sandia.gov/job/KokkosEco_Trilinos_Weaver_Gcc830_OpenMP_opt/251/consoleFull

Copy link

github-actions bot commented Jun 8, 2024

This issue has had no activity for 365 days and is marked for closure. It will be closed after an additional 30 days of inactivity.
If you would like to keep this issue open please add a comment and/or remove the MARKED_FOR_CLOSURE label.
If this issue should be kept open even with no activity beyond the time limits you can add the label DO_NOT_AUTOCLOSE.
If it is ok for this issue to be closed, feel free to go ahead and close it. Please do not add any comments or change any labels or otherwise touch this issue unless your intention is to reset the inactivity counter for an additional year.

@github-actions github-actions bot added the MARKED_FOR_CLOSURE Issue or PR is marked for auto-closure by the GitHub Actions bot. label Jun 8, 2024
Copy link

This issue was closed due to inactivity for 395 days.

@github-actions github-actions bot added the CLOSED_DUE_TO_INACTIVITY Issue or PR has been closed by the GitHub Actions bot due to inactivity. label Jul 10, 2024
@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Jul 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLOSED_DUE_TO_INACTIVITY Issue or PR has been closed by the GitHub Actions bot due to inactivity. MARKED_FOR_CLOSURE Issue or PR is marked for auto-closure by the GitHub Actions bot. pkg: Teuchos Issues primarily dealing with the Teuchos Package pkg: Tpetra type: bug The primary issue is a bug in Trilinos code or tests
Projects
Status: Done
Development

No branches or pull requests

7 participants