-
Notifications
You must be signed in to change notification settings - Fork 564
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TriBITS: Nalu-Wind Cuda builds won't configure #10954
Comments
Is it possible to reproduce the problem outside of Spack? |
It's very difficult to setup everything correctly to get a build outside of spack-manager. Inside of the spack-manager framework, we tried with Trilinos hashes before and after #10614, and the Nalu-Wind build succeeded with the old hash and failed with the new hash. Also, the last night that the Nalu-Wind Cuda nightlies configured correctly on ascicgpu was July 17/18, right before #10614 merged. If needed, I can develop a set of scripts to reproduce this without using spack-manager (although you'll still need spack for TPLs below Trilinos), but it will take a little while. Let me know though, because if it's the only path forward I'll definitely do it. |
@tasmith4, that is what I figured. But can you at least run a reconfigure after Spack loads the build env and puts you in the build dir? There is no mention at all of CMake in: What is your experience with this, just so I have the right expectations? |
So spack will automatically run cmake as part of the install step. When you have Trilinos setup as a develop spec in spack, you can go to the Trilinos build directory and blow away the CMake* files as you might in any other project, which will force spack to do the reconfigure. |
@tasmith4, how do you force Spack to do just the configure? What are the exact commands to use in this workflow? |
Do you want to reconfigure nalu-wind, trilinos, or both? |
Likely both. Depends if the errors are present when building only Nalu or if you can you reproduce them by enabling and building the Trilinos tests as well? I am guessing that I will need to be on an ascicgpu machine to reproduce this? |
We can't reproduce by enabling Trilinos tests (specifically, we did not see the issue in a build that had STK tests turned on). I think in this workflow you need to run It is likely you'll need to be on an ascicgpu machine, do you have access to one? |
Would I be blazing new territory here with Spack usage at SNL? Has no one tried to do micro-level CMake configuration and building with Trilinos through Spack?
I have access to some. Just checking that is the correct platform to try. |
Probably? Typically, we've done that by editing the trilinos package at https://github.com/psakievich/spack-manager/blob/main/repos/exawind/packages/trilinos/package.py. But I don't know anyone who's only been focused on the configure step, usually it's a means to getting the right build so we always want the build & install. @psakievich might have tried something else or know others who have. |
@bartlettroscoe you can dive into the build environment and tweak things like you are in a normal development with the Spack is essentially just embodying the generation of a CMake script through it's package+the configure phase so you can also edit the |
@psakievich and @tasmith4, what exact version of the Spack repo are you using and what are the exact arguments you are passing to Spack? Is it just |
@bartlettroscoe we have spack submoduled in spack-manager. We stay pretty close to the edge of the develop branch. But an exact reproducer would be: # setup spack-manager
git clone --recursive https://github.com/psakievich/spack-manager
export SPACK_MANAGER=$(pwd)/spack-manager
source ${SPACK_MANAGER}/start.sh
# create the spack environment
quick-create-dev --name bug --spec nalu-wind@master+cuda cuda_arch=70 trilinos@develop
# build
spack install
# modify configure (takes you to the build directory etc.)
build-env-dive nalu-wind Edit: If you would like any help with this feel free to schedule a meeting with me. |
@psakievich, I will give this a try and get back to you. It will give me a chance to kick the tires on Spack support for development and provide some feedback. |
@bartlettroscoe When I want to stop spack after a configure - I break it. This isn't ideal, because Spack sets various ENVs - most relevant is CMAKE_PREFIX (forget the name) - that talls CMake where to find cmake modules. Still, for most of my stuff, Trilinos' CMake makes it explicit where TPLs live... so you can just make sure you've got the correct compilers in path and configure with the copy/paste'd Cmake line. There's a way to have spack stop - and drop you into the build ENV, but that's way above my pay grade. Another way is to
It all feels like a mess to me. Wish spack supported R&D "using spack" (ya know, just drop the build Dir + cmake configure and exit) - I think it can do that, but it should be way easier and more obvious (Clearly!) |
@jjellio FWIW this is a way to drop into the environment without all that extra work that I curated. It is the last line in #10954 (comment) The source code really isn't bad. Just dump the env to file and then source it in a new shell. There is another native command |
@bartlettroscoe have you had a chance to look into this? It is a blocker to upgrading the version of Trilinos we use with Nalu-Wind. |
Thanks for reminder. Let me see if I can reproduce following commands above. |
I have been able to reproduce the configure failure shown above (see details section below) and I believe the solution is straightforward. The problem is the the TriBITS TPL glue module What is happening is the the old So the next step is to upgrade Using Spack-manager, it is possible to just rebuild Trilinos from the local Git repo in-place and then try building Nalu-wind again? That will allow me to confirm quickly that this fixes this Nalu-wind use case. Detailed notes on reproduction and analysis of the bug (click to expand)I am now going to try to reproduce the Nalu failure. It seems I need to reproduce this on an ascicgpu machine. The instructions are at: Doing:
Okay, so that reproduced the problem. Now to see if I can enter the dev env and do the configure manually:
Now let's try to configure manually:
Now I need to find out where the nalu-wind source tree is located? Greping the CMakeCache.txt file I see:
And in the file:
at lines 77-86, I see:
Okay, I think the problem here is the find_package(CUDAToolkit) command at lines 42-52:
So it must be that the
To test this, I will comment out lines CMakeLists.txt:42-52:
and configure again:
And bingo. The TriBITS-generated CUDAConfig.cmake file contains:
You can see the targets So I think the solution to this problem is to refactor tribits/cmake/core/std_tpls/FindTPLCUDA.cmake to use find_package(CUDAToolkit). It seems that FindCUDAToolkit.cmake was added in CMake 3.17. In fact, CMake 3.17 offically deprecates FindCUDA.cmake as per: So the solution seems straightforard. TriBITS and Trilinos should not be using a deprecated module. |
@bartlettroscoe this is good news. Yes if you make the changes in the trilinos source code that was cloned and then run spack env activate ${SPACK_MANAGER}/environments/bug # ensure the spack environment is active (assuming the commands above were executed)
spack cd -s trilinos # space will take you directly to the source code the environment is using
# make local code changes
spack install If you already have the environment active then you can skip step 1. |
@psakievich, I will have to do more testing on the Trilinos side, but I believe I have the fix. Is there a way to run tests for nalu-wind from within the Spack nalu-wind build dir? |
@bartlettroscoe yes go ahead and run these commands build-env-dive nalu-wind
ctest -R unit -VV If it builds and the unit tests run then that should be sufficient for this issue. |
@bartlettroscoe thanks for this. The segfault is a nalu-wind developer issue we'll have to track down. From the trilinos perspective it looks like the build issues are resolved. Thank you for getting this worked out. |
@psakievich, okay, I will clean this up, do some testing with Trilinos and post a Trilinos PR to resolve this. Thanks for posting issue and pointing out this problem. With the move to modern CMake and the generation of modern namespaced imported targets, there was bound to be a conflict at some point. |
NOTE: This has matching changes in the downstream Trilinos TPLs CUBLAS and CUSPARSE. You can't use this updated TriBITS version with Trilinos without those matching Trilinos FindTPL<tplName>.cmake changes.
NOTE: This requires the matching change to the TriBITS file tribits/core/std_tpls/FindTPLCUDA.cmake.
FYI: We have run into a problem when testing against the full Trilinos test suite. See #11084. I will post a PR and then someone will need to upgrade Stokhos's usage of CUSPARSE in order to allow that PR to merge and therefore things for Nalu-wind. Update: Looks like Stokhos is disabled in the rhel7 |
NOTE: This has matching changes in the downstream Trilinos TPLs CUBLAS and CUSPARSE. You can't use this updated TriBITS version with Trilinos without those matching Trilinos FindTPL<tplName>.cmake changes (see trilinos/Trilinos#10954).
This is needed in order to avoid namespace classes with downstream CMake projects that call find_package(CUDAToolkit) (e.g. like Nalu-Wind, see trilinos/Trilinos#10954). NOTE: This has matching changes in the downstream Trilinos TPL files FindTPLCUBLAS.cmake and FindTPLCUSPARSE.cmake. You can't use this updated TriBITS version with Trilinos without those matching Trilinos FindTPL<tplName>.cmake changes (see trilinos/Trilinos#10954).
Change to use find_package(CUDAToolkit) and some other changes (trilinos/Trilinos#10954)
Origin repo remote tracking branch: 'github/master' Origin repo remote repo URL: 'github = git@github.com:TriBITSPub/TriBITS.git' Git describe: Vera4.0-RC1-start-1292-g6d3bb5b3 At commit: commit 23dc20b901ab55943b71e51f5e64a244ad186b5a Author: Roscoe A. Bartlett <rabartl@sandia.gov> Date: Thu Sep 29 11:46:47 2022 -0600 Summary: Change to use find_package(CUDAToolkit) (#10954)
The fix is in PR #11093. I just needs to be approved and pass the PR builds and then it will be on the 'develop' branch. |
Thanks @bartlettroscoe ! |
…d-fix-cuda Automatically Merged using Trilinos Pull Request AutoTester PR Title: Change TriBITS/Trilinos TPLs to use find_package(CUDATookit) to fix builds with downstream customers using find_package(CUDATookit) (#10954) PR Author: bartlettroscoe
FYI: PR #11093 just merged to the 'develop' branch. Don't know when the 'master' branch will be updated but if you use the 'develop' branch you should be good to go. When it is confirmed that Nalu-Wind builds with the updated Trilinos version, can you put a comment in this issue so we can close this? Sorry for the delays in getting this triaged, fixed, and deployed. |
thanks @bartlettroscoe! Our nightly testing has lines that use |
@bartlettroscoe @tasmith4 as a friendly data point, I was able to build a cuda enabled nalu-wind with trilinos' develop branch this afternoon. This uses the same configuration as the nightly test. Thanks for all your work on this @bartlettroscoe. I'll let @tasmith4 give the final okay for closing this issue though. |
sweet! I doubt there will be an issue, but would still like to see it work on the nightlies before closing. |
…s:develop' (c748ef0). * trilinos-develop: (21 commits) Tpetra FECrs_MatrixMatrix: Adjust tolerance Tpetra: Removing floating point hard-equality check in test Sync latest Percept from Sierra Removes unneeded fences after Tpetra multiply in Belos and Thyra interfaces. (trilinos#10991) ZELLIJ: Add offset; fix parallel output Revert "Applications:" Revert "Update to work with current TriBITs where needed" Revert "Snapshot changes (partial)" Revert "IOSS Snapshot changes " Tpetra: Modifications to BlockCrsMatrix test to fix geminga nightlies Drivers: oops Drivers: Adding SYCL/CPU nightly to lightsaber Drivers: Adding SYCL/CPU nightly to lightsaber IOSS Snapshot changes Snapshot changes (partial) Update to work with current TriBITs where needed Applications: * Clean out extra blank lines * Add better url for showing users where documentation is located. Printed when -help entered. * clang-format differences Automatic snapshot commit from tribits at 23dc20b9 Intrepid2: changes to avoid compiler warnings Change to use find_package(CUDAToolkit) (trilinos#10954) ...
…s:develop' (c748ef0). * trilinos-develop: (21 commits) Tpetra FECrs_MatrixMatrix: Adjust tolerance Tpetra: Removing floating point hard-equality check in test Sync latest Percept from Sierra Removes unneeded fences after Tpetra multiply in Belos and Thyra interfaces. (trilinos#10991) ZELLIJ: Add offset; fix parallel output Revert "Applications:" Revert "Update to work with current TriBITs where needed" Revert "Snapshot changes (partial)" Revert "IOSS Snapshot changes " Tpetra: Modifications to BlockCrsMatrix test to fix geminga nightlies Drivers: oops Drivers: Adding SYCL/CPU nightly to lightsaber Drivers: Adding SYCL/CPU nightly to lightsaber IOSS Snapshot changes Snapshot changes (partial) Update to work with current TriBITs where needed Applications: * Clean out extra blank lines * Add better url for showing users where documentation is located. Printed when -help entered. * clang-format differences Automatic snapshot commit from tribits at 23dc20b9 Intrepid2: changes to avoid compiler warnings Change to use find_package(CUDAToolkit) (trilinos#10954) ...
looks good this morning -- thanks again @bartlettroscoe! |
Bug Report
@bartlettroscoe @psakievich
Description
When building Nalu-Wind with Cuda enabled against the latest Trilinos develop, we get Cuda-related configure errors such as
add_library cannot create imported target "CUDA::cufft" because another target with the same name already exists.
(see below for more extended output). We are reasonably certain these are caused by #10614 (unfortunately it took us a while to suspect this was the culprit, thus the delayed reporting of this issue).@bartlettroscoe I did not see these errors in the list in #10774, do you have ideas on what needs to change in Nalu-Wind's CMake system?
This can be reproduced with spack-manager (developer workflow quick start instructions here) on ascicgpu machines, using the spec
nalu-wind+cuda cuda_arch=70 ^trilinos@develop
, although @psakievich and I are happy to try stuff out ourselves as needed to resolve this issue.Extended output: (click to expand)
The text was updated successfully, but these errors were encountered: