Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ShyLY: Tacho_TestSerial_double.cpp build error in intel-19.0.5 PR builds since 5/18/2022 #10549

Closed
bartlettroscoe opened this issue May 20, 2022 · 16 comments
Labels
CLOSED_DUE_TO_INACTIVITY Issue or PR has been closed by the GitHub Actions bot due to inactivity. MARKED_FOR_CLOSURE Issue or PR is marked for auto-closure by the GitHub Actions bot. PA: Framework Issues that fall under the Trilinos Framework Product Area pkg: ShyLU type: bug The primary issue is a bug in Trilinos code or tests

Comments

@bartlettroscoe
Copy link
Member

Bug Report

@trilinos/framework, @trilinos/shylu

Description

As shown in this query showing:

image

the file:

packages/shylu/shylu_node/tacho/unit-test/Tacho_TestSerial_double.cpp

is failing to build with the build error (for example, as shown here):

FAILED: packages/shylu/shylu_node/tacho/unit-test/CMakeFiles/Tacho_TestSerialDouble.dir/Tacho_TestSerial_double.cpp.o 
/projects/sems/install/rhel7-x86_64/sems/compiler/intel/19.0.5/mpich/3.2/bin/mpicxx  -I. -I/scratch/trilinos/jenkins/ascic142/workspace/trilinos-folder/intel-19.0.5/Trilinos/packages/shylu/shylu_node/tacho/unit-test -Ipackages/shylu/shylu_node/tacho/unit-test -Ipackages/shylu/shylu_node/tacho/src -I/scratch/trilinos/jenkins/ascic142/workspace/trilinos-folder/intel-19.0.5/Trilinos/packages/shylu/shylu_node/tacho/src -I/scratch/trilinos/jenkins/ascic142/workspace/trilinos-folder/intel-19.0.5/Trilinos/packages/shylu/shylu_node/tacho/src/impl -Ipackages/kokkos/algorithms/src -I/scratch/trilinos/jenkins/ascic142/workspace/trilinos-folder/intel-19.0.5/Trilinos/packages/kokkos/algorithms/src -Ipackages/kokkos/containers/src -I/scratch/trilinos/jenkins/ascic142/workspace/trilinos-folder/intel-19.0.5/Trilinos/packages/kokkos/containers/src -Ipackages/kokkos/core/src -I/scratch/trilinos/jenkins/ascic142/workspace/trilinos-folder/intel-19.0.5/Trilinos/packages/kokkos/core/src -Ipackages/kokkos -I/scratch/trilinos/jenkins/ascic142/workspace/trilinos-folder/intel-19.0.5/Trilinos/commonTools/gtest -fPIC -Wall -Warray-bounds -Wchar-subscripts -Wcomment -Wenum-compare -Wformat -Wuninitialized -Wmaybe-uninitialized -Wmain -Wnarrowing -Wnonnull -Wparentheses -Wpointer-sign -Wreorder -Wreturn-type -Wsign-compare -Wsequence-point -Wtrigraphs -Wunused-function -Wunused-but-set-variable -Wunused-variable -Wwrite-strings   -O3 -DNDEBUG -fPIE -std=c++14 -MD -MT packages/shylu/shylu_node/tacho/unit-test/CMakeFiles/Tacho_TestSerialDouble.dir/Tacho_TestSerial_double.cpp.o -MF packages/shylu/shylu_node/tacho/unit-test/CMakeFiles/Tacho_TestSerialDouble.dir/Tacho_TestSerial_double.cpp.o.d -o packages/shylu/shylu_node/tacho/unit-test/CMakeFiles/Tacho_TestSerialDouble.dir/Tacho_TestSerial_double.cpp.o -c /scratch/trilinos/jenkins/ascic142/workspace/trilinos-folder/intel-19.0.5/Trilinos/packages/shylu/shylu_node/tacho/unit-test/Tacho_TestSerial_double.cpp
In file included from /scratch/trilinos/jenkins/ascic142/workspace/trilinos-folder/intel-19.0.5/Trilinos/packages/shylu/shylu_node/tacho/unit-test/Tacho_Test.hpp(15),
                 from /scratch/trilinos/jenkins/ascic142/workspace/trilinos-folder/intel-19.0.5/Trilinos/packages/shylu/shylu_node/tacho/unit-test/Tacho_TestSerial_double.cpp(21):
/scratch/trilinos/jenkins/ascic142/workspace/trilinos-folder/intel-19.0.5/Trilinos/packages/shylu/shylu_node/tacho/unit-test/Tacho_TestDenseLinearAlgebra.hpp(113): error: more than one instance of overloaded function "abs" matches the argument list:
            function "abs(int)"
            function "std::abs(long long)"
            function "std::abs(long)"
            argument types are: (double)
        EXPECT_NEAR(abs(C1.h_view(i,j)), abs(C2.h_view(i,j)), eps);
        ^
...

Specifically, this is impacting two different PRs #10533 and #10537 since 5/18/2022 and three different PR iterations:

(NOTE: The build error for the build PR-10530-test-rhel7_sems-intel-19.0.5-mpich-3.2-serial_release-debug_static_no-kokkos-arch_no-asan_no-complex_fpic_mpi_no-pt_no-rdc_no-uvm_deprecated-on_no-package-enables-297 for PR #10530 is not related and the last PR iteration for here passed.)

If you look at the changes in the PRs #10533 and #10537 don't impact the ShyLU_Node package at all so this error must be present in Trilinos 'develop'.

Steps to Reproduce

Run the intel-19.0.5 PR build that enables the ShyLU_Node package. (I don't know if you can reproduce this error locally, I have not tired.)

@bartlettroscoe bartlettroscoe added type: bug The primary issue is a bug in Trilinos code or tests pkg: ShyLU PA: Framework Issues that fall under the Trilinos Framework Product Area labels May 20, 2022
@bartlettroscoe
Copy link
Member Author

@trilinos/framework, @trilinos/shylu

FYI: This is going to fail in all PR builds that happen to enable the ShyLU_Node packages and those PR will not be able to be merged until this error is fixed on the 'develop' branch.

@bartlettroscoe
Copy link
Member Author

FYI: It looks like this intel-19.0.5 build is also having trouble in the 'develop' to 'master' sync builds going back to 5/16/2022 as shown here:

showing:

image

Unfortunately, there is something wrong as you can't actually view the build errors. When you click on them, they come up empty.

@kyungjoo-kim
Copy link
Contributor

@ndellingwood I think that #10474 causes this failure.

@ndellingwood
Copy link
Contributor

@kyungjoo-kim thanks for adding me on the issue, I'll try and help.

The Tacho error in the post doesn't make sense to me, it only lists std::abs options for integral type input args but double is a valid input argument type (https://en.cppreference.com/w/cpp/numeric/math/fabs), why isn't the compiler even listing the correct overload option (and hence not error out)?
The error was also reported previously here: #10486 (comment)

I haven't run into this issue with serial builds with intel/19 on blake.
@bartlettroscoe Is this a problem with the intel/19 toolchain that was used in the testing where this error occurred? It doesn't make sense that the compiler would not pick up the option for std::abs that accepts a double as input argument. What version of gcc is loaded with the testing environment?

If this can't be resolved with the toolchain or environment, I can either revert #10474 and make similar changes tracked on the kokkos-promotion branch of Trilinos (otherwise we will have build failures with the kokkos develop branch) or I can try explicitly adding the Kokkos:: namespacing to the abs calls in the Tacho unit test where this occurs

@bartlettroscoe
Copy link
Member Author

I haven't run into this issue with serial builds with intel/19 on blake.
@bartlettroscoe Is this a problem with the intel/19 toolchain that was used in the testing where this error occurred? It doesn't make sense that the compiler would not pick up the option for std::abs that accepts a double as input argument. What version of gcc is loaded with the testing environment?

@ndellingwood, I can't comment on that as I did not have any part in setting up this intel-19.0.5 PR build. I would hope that someone could reproduce that intel-19.0.5 build locally so the problem can be fixed.

@ZUUL42, given that your PR #10537 is impacted by this intel-19.0.5 build error as well, do you know how to reproduce this build locally, like on a CEE RHEL7 machine?

@ndellingwood
Copy link
Contributor

I'm testing a couple builds on a machine with sems-archive-intel/19.0.5 and will see if I can reproduce.

An early observation worth sharing, module load sems-archive-intel/19.0.5 auto-loads sems-archive-gcc/6.1.0, which is a version of gcc that will not be supported much longer by kokkos and dropped once c++17 support is required.
If older versions of gcc are used in the newer PR builds that are being stood up it might be better to update to a newer version of gcc (with full c++17 support) sooner than later to avoid eventual issues

@ndellingwood
Copy link
Contributor

ndellingwood commented May 20, 2022

@bartlettroscoe @ZUUL42 @kyungjoo-kim I can confirm this is toolchain problem in part by use of sems-archive-gcc/6.1.0

I tested two builds: one loading module load sems-archive-intel/19.0.5 (and thus sems-archive-gcc/6.1.0) which reproduced the error, another build with a newer gcc via module load sems-archive-gcc/7.3.0 sems-archive-intel/19.0.5 which compiled without error (note module swap of the auto-loaded gcc is not allowed with the sems-archive compilers)

Here are more details of my builds (same cmake configuration for each):

Build 1: Compilation error reproduced

module purge
module load sems-archive-env 
module load sems-archive-intel/19.0.5 sems-archive-cmake/3.19.1

cmake \
 -DCMAKE_CXX_COMPILER=`which icpc` -DCMAKE_C_COMPILER=`which icc` -DCMAKE_Fortran_COMPILER=`which ifort` \
 -DTPL_ENABLE_MPI=OFF -DTPL_ENABLE_BLAS:BOOL=ON -DTPL_ENABLE_LAPACK:BOOL=ON  -DTPL_BLAS_LIBRARIES="/usr/lib64/libblas.so.3" -DTPL_LAPACK_LIBRARIES="/usr/lib64/liblapack.so.3" \
 -DTrilinos_ENABLE_TESTS=OFF -DTrilinos_ENABLE_COMPLEX_DOUBLE=ON \
 -DTrilinos_ENABLE_Kokkos=ON -DKokkos_ENABLE_SERIAL=ON -DTrilinos_ENABLE_ShyLU_NodeTacho=ON -DShyLU_NodeTacho_ENABLE_TESTS=ON \
$TRILINOS_DIR

Build 2: Newer gcc, no compilation error

module purge
module load sems-archive-env 
module load sems-archive-gcc/7.3.0 sems-archive-intel/19.0.5 sems-archive-cmake/3.19.1

cmake \
 -DCMAKE_CXX_COMPILER=`which icpc` -DCMAKE_C_COMPILER=`which icc` -DCMAKE_Fortran_COMPILER=`which ifort` \
 -DTPL_ENABLE_MPI=OFF -DTPL_ENABLE_BLAS:BOOL=ON -DTPL_ENABLE_LAPACK:BOOL=ON  -DTPL_BLAS_LIBRARIES="/usr/lib64/libblas.so.3" -DTPL_LAPACK_LIBRARIES="/usr/lib64/liblapack.so.3" \
 -DTrilinos_ENABLE_TESTS=OFF -DTrilinos_ENABLE_COMPLEX_DOUBLE=ON \
 -DTrilinos_ENABLE_Kokkos=ON -DKokkos_ENABLE_SERIAL=ON -DTrilinos_ENABLE_ShyLU_NodeTacho=ON -DShyLU_NodeTacho_ENABLE_TESTS=ON \
$TRILINOS_DIR

I don't think this needs to be resolved within Tacho

@ZUUL42
Copy link
Contributor

ZUUL42 commented May 20, 2022

The Intel 19 master merge build is still using the same version of the build it has used for quite some time, though it has been migrated to use GenConfig. It still uses the sems-archive TPLs.
We continued using Intel 17 for PR testing because we were never able to get the sems-archive TPLs configured to use Scotch and had to disable a number of things. See previous commit. While those lines aren't used anymore, they were migrated to GenConfig for the Intel 19 master merge build.

For PR testing, what we are now using is an Intel 19 build that successfully uses Scotch. It uses the sems-[non-archive] TPLs. The new spack-cm SEMS modules made it possible to use Scotch with Intel 19 again. Quite a bit of back and forth and tweaking was required.
So with this new Intel 19 PR build now fully functional, save current build error, we wanted to retire Intel 17 as a PR build and use this new Intel 19 build. However, with the troubles in the Master Merge build, we didn't want to change that one over just yet.

As for the current state of the PR Intel 19 build you are pointing out, it appears a build error that only happens in Intel 19 was introduced between my testing and releasing it. CDash The issue you pointed out where the job uses the source's develop branch, #10538, may have contributed to that. My develop branch hasn't been updated since Feb 25. So, the issue was introduced between then and now.

As for the Intel 19 Master Merge builds errors that started on Apr 29. Those must have been able to be introduced due to not having an Intel 19 PR build test in place at that time. Those errors won't be changed by anything we do with the new Intel 19 PR build.

I can swap back to the Intel 17 build for PR testing. However that will not fix the master merge errors. The Intel 19 MM build has been there for quite some time and has not changed. I'm reluctant to swap back as it will potentially allow other issues to pass through PR testing and hit the Intel 19 build in MM testing.
We need to get these build errors fixed if we are to have Intel 19 as part of either test build set, PR or MM.

@bartlettroscoe
Copy link
Member Author

The Intel 19 master merge build is still using the same version of the build it has used for quite some time, though it has been migrated to use GenConfig.

@ZUUL42, where are the instructions for reproducing this intel-19.0.5 build using GenConfig? The instructions at https://github.com/trilinos/Trilinos/wiki/Reproducing-PR-Testing-Errors don't seem to mention GenConfig and the script PullRequestIntel19.0.5TestingEnv.sh does not seem to be using GenConfig. I am confused how the PR testing works and how it loads the envs. Does it not use the scripts listed under cmake/std/sems/?

@bartlettroscoe
Copy link
Member Author

I can confirm this is toolchain problem with sems-archive-gcc/6.1.0

IMHO, the most logical next step is that this intel-19.0.5 build needs to be removed from blocking Trilinos PR builds immediately so that we can get PRs merging again. Then the issues with this intel-19.0.5 build can be resolved offline.

@bartlettroscoe
Copy link
Member Author

My develop branch hasn't been updated since Feb 25. So, the issue was introduced between then and now.

Yea, a lot can change in more than 2 months.

@ZUUL42, is there not a way to introduce new Trilinos PR builds through the Trilinos PR process? That is, you don't add a new Trilinos PR build unless the Trilinos PR testing process allows it. That would ensure that this type of situation does not happen in the future.

@ndellingwood
Copy link
Contributor

Can a new issue be opened for ongoing discussion of the intel-19.0.5 build (and its status as PR and/or master-merge acceptance test) for easier tracking and reference, with a link to this issue to track an item needing resolution?

Regarding the build failure reported with Tacho, the problem is not with source code in Tacho though the compiler error implies this: std::abs with input argument type of double is supported and should not be problematic. The problem is at the toolchain level and should be resolved with appropriate updates there. Cross-linking #10549 (comment)

@ZUUL42
Copy link
Contributor

ZUUL42 commented May 20, 2022

The MM build uses GenConfig. The PR build does not use GenConfig.
The PR build uses cmake/std/pr_config/pullrequest.ini
Neither use the .sh files.

An update to the .sh script was left out of the PR. We are moving away from those with GenConfig. Synchronizing all efforts has been troublesome.
I can file a PR to update that, but then it will only be current for the Intel 19 PR build and will not apply to the Intel 19 MM build.

As for introducing a new PR build within the PR system, that would require first adding it to the PR test set. That would prevent all other PRs from testing successfully and merging until the new build's PR is merged in. Typically that has not been necessary with the testing I do beforehand, but as you found I was working with a job that had a typo in it that I overlooked.

I can revert the PR test set to Intel 17.
However, that will not have any effect on the errors that have existed in Master Merge.

****** While working through this, I have discovered that sems-boost/1.74.0 appears to exist for a few gcc versions and clang/11, but not for sems-intel/19.0.5. So, it's not being loaded at all.
There is a sems-boost/1.70.0 I can move to for the Intel 19 PR build, but it seems best to keep it consistent across all builds.
@fryeguy52 could you look into making sure there is a sems-boost/1.74.0 built for sems-intel/19.0.5?
I'm not sure if that would explain any of these issues though.
And unfortunately this still doesn't affect the Intel 19 MM build.

Another thing I just found that may effect this. sems-mpich/3.2.1 is being loaded before sems-openmpi/4.0.5. openmpi is part of a default TPL set.
So openmpi is replacing mpich. I can reverse the order of those to make sure mpich replaces openmpi. It's going to take a while to run the tests to verify that the change helped.

@bartlettroscoe
Copy link
Member Author

Can a new issue be opened for ongoing discussion of the intel-19.0.5 build (and its status as PR and/or master-merge acceptance test) for easier tracking and reference, with a link to this issue to track an item needing resolution?

Makes sense to me.

@github-actions
Copy link

This issue has had no activity for 365 days and is marked for closure. It will be closed after an additional 30 days of inactivity.
If you would like to keep this issue open please add a comment and/or remove the MARKED_FOR_CLOSURE label.
If this issue should be kept open even with no activity beyond the time limits you can add the label DO_NOT_AUTOCLOSE.
If it is ok for this issue to be closed, feel free to go ahead and close it. Please do not add any comments or change any labels or otherwise touch this issue unless your intention is to reset the inactivity counter for an additional year.

@github-actions github-actions bot added the MARKED_FOR_CLOSURE Issue or PR is marked for auto-closure by the GitHub Actions bot. label May 21, 2023
@github-actions
Copy link

This issue was closed due to inactivity for 395 days.

@github-actions github-actions bot added the CLOSED_DUE_TO_INACTIVITY Issue or PR has been closed by the GitHub Actions bot due to inactivity. label Jun 21, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLOSED_DUE_TO_INACTIVITY Issue or PR has been closed by the GitHub Actions bot due to inactivity. MARKED_FOR_CLOSURE Issue or PR is marked for auto-closure by the GitHub Actions bot. PA: Framework Issues that fall under the Trilinos Framework Product Area pkg: ShyLU type: bug The primary issue is a bug in Trilinos code or tests
Projects
None yet
Development

No branches or pull requests

4 participants