Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New SEACAS tests failing in ATDM Trilinos builds starting on 7/19/2018 and 7/23/2018 #3183

Closed
bartlettroscoe opened this issue Jul 24, 2018 · 16 comments
Labels
client: ATDM Any issue primarily impacting the ATDM project PA: Data Services Issues that fall under the Trilinos Data Services Product Area pkg: seacas type: bug The primary issue is a bug in Trilinos code or tests

Comments

@bartlettroscoe
Copy link
Member

bartlettroscoe commented Jul 24, 2018

New SEACAS tests failing in ATDM Trilinos builds starting on 7/19/2018 and 7/23/2018

CC: @trilinos/seacas, @gsjaardema (pushed breaking commits?), @kddevin (Trilinos Data Services Product Lead)

Next Action Status

PR #3213 merged on 8/1/2018 then later fixed in PR #3251 merged 8/8/2018 that disabled most of these tests in the 'mutrino' builds on 8/2/2018. No test failures since 8/8/2018 as of 8/29/2018.

Description

As shown in this query for the builds today, the tests:

  • SEACASAprepro_aprepro_array_test
  • SEACASAprepro_aprepro_command_line_include_test
  • SEACASAprepro_aprepro_command_line_vars_test
  • SEACASAprepro_aprepro_unit_test
  • SEACASAprepro_lib_aprepro_lib_array_test
  • SEACASAprepro_lib_aprepro_lib_unit_test
  • SEACASExodus_exodus_unit_tests_nc5_env

are failing in the builds:

  • Trilinos-atdm-mutrino-intel-debug-openmp
  • Trilinos-atdm-mutrino-intel-opt-openmp

and the tests:

  • SEACASIoss_exodus32_to_exodus32
  • SEACASIoss_exodus32_to_exodus32_pnetcdf
  • SEACASIoss_exodus32_to_exodus64

are failing in the builds:

  • Trilinos-atdm-hansen-shiller-cuda-8.0-debug
  • Trilinos-atdm-hansen-shiller-cuda-8.0-opt

As shown in this query showing failing SEACAS tests going back to 7/10/2018, the test SEACASExodus_exodus_unit_tests_nc5_env started failing on 7/19/2018 and the other tests started failing on 7/23/2018. There were several PRs merged the days before these dates by @gsjaardema so it is not clear which changes caused these new failures but it seems likely that one or more of the commits in these merged PRs triggered these new failures.

Also, the test SEACASAprepro_aprepro_test_dump_reread added in one of these PRs appeared on 7/23/2018 and then started randomly failing as shown in this query. When the test passes like shown here, it shows:

================================================================================

TEST_3

Running: "diff" "-w" "test-filter.dump" "test-reread.dump"

--------------------------------------------------------------------------------


--------------------------------------------------------------------------------

TEST_3: Return code = 0
TEST_3: Pass criteria = Zero return code [PASSED]
TEST_3: Result = PASSED

================================================================================

when it fails like shown here, it shows:

================================================================================

TEST_3

Running: "diff" "-w" "test-filter.dump" "test-reread.dump"

--------------------------------------------------------------------------------

1,2c1,2
< Thu Aug  2 11:20:16 2018: [unset]:_pmi_alps_init:alps_get_placement_info returned with error -1
< Thu Aug  2 11:20:16 2018: [unset]:_pmi_init:_pmi_alps_init returned -1
---
> Thu Aug  2 11:20:17 2018: [unset]:_pmi_alps_init:alps_get_placement_info returned with error -1
> Thu Aug  2 11:20:17 2018: [unset]:_pmi_init:_pmi_alps_init returned -1

--------------------------------------------------------------------------------

TEST_3: Return code = 1
TEST_3: Pass criteria = Zero return code [FAILED]
TEST_3: Result = FAILED

================================================================================

Steps to reproduce

These failures should be reproducable on the machines 'hansen' or 'shiller' and 'mutrino' using the instructions in:

For example, for the failures on 'hansen'/'shiler', the specific instructions are given at:

For example, after cloning Trilinos, the following commands should reproduce the test failures on 'hansen' or 'shiller' with:

$ cd <some_build_dir>/

$ source $TRILINOS_DIR/cmake/std/atdm/load-env.sh cuda-8.0-debug

$ cmake \
  -GNinja \
  -DTrilinos_CONFIGURE_OPTIONS_FILE:STRING=cmake/std/atdm/ATDMDevEnvSettings.cmake \
  -DTrilinos_ENABLE_TESTS=ON -DTrilinos_ENABLE_SEACAS=ON \
  $TRILINOS_DIR

$ make NP=16

$ srun ctest -j16
@bartlettroscoe bartlettroscoe added type: bug The primary issue is a bug in Trilinos code or tests pkg: seacas client: ATDM Any issue primarily impacting the ATDM project labels Jul 24, 2018
@gsjaardema
Copy link
Contributor

@bartlettroscoe
The aprepro test failures on mutrino all seem to be due to the system writing some extra data to stderr. The test is diffing stderr with the gold stderr output. I guess I need a better way of getting just the aprepro stderr output... The extra data being written to stderr is:

0a1,2
> Tue Jul 24 08:34:53 2018: [unset]:_pmi_alps_init:alps_get_placement_info returned with error -1
> Tue Jul 24 08:34:53 2018: [unset]:_pmi_init:_pmi_alps_init returned -1

The new capability in aprepro is the addition of exodus query capability, so the extra data may be related to linking aprepro now to a parallel hdf5 and netcdf library...

I will take a look.

@gsjaardema
Copy link
Contributor

@bartlettroscoe
The SEACASIoss tests on hansen should be fixed with the next test run since #3167 addresses exactly the problem causing these to fail.

@bartlettroscoe
Copy link
Member Author

@gsjaardema, if you don't want to deal with the hassle of fixing those tests on 'mutrino', we can just disable them on 'mutrino'. I can provide detailed instructions on how to do that.

@gsjaardema
Copy link
Contributor

@bartlettroscoe Yes, that would be good to do.

@gsjaardema
Copy link
Contributor

@bartlettroscoe How do I disable the tests on mutrino?

bartlettroscoe added a commit to bartlettroscoe/Trilinos that referenced this issue Jul 31, 2018
bartlettroscoe added a commit to bartlettroscoe/Trilinos that referenced this issue Jul 31, 2018
bartlettroscoe added a commit to bartlettroscoe/Trilinos that referenced this issue Jul 31, 2018
bartlettroscoe added a commit to bartlettroscoe/Trilinos that referenced this issue Jul 31, 2018
bartlettroscoe added a commit to bartlettroscoe/Trilinos that referenced this issue Jul 31, 2018
bartlettroscoe added a commit to bartlettroscoe/Trilinos that referenced this issue Jul 31, 2018
bartlettroscoe added a commit to bartlettroscoe/Trilinos that referenced this issue Jul 31, 2018
bartlettroscoe added a commit to bartlettroscoe/Trilinos that referenced this issue Jul 31, 2018
bartlettroscoe added a commit to bartlettroscoe/Trilinos that referenced this issue Jul 31, 2018
Hopefully this documentation will allow any Trilinos developer to selectively
disable tests for the ATDM Trilinos builds.

Hopefully this documentation will allow a Trilinos developer to disable tests
as part of trilinos#3183 but this will be used in many future issues as well.
@bartlettroscoe
Copy link
Member Author

@bartlettroscoe How do I disable the tests on mutrino?

@gsjaardema, if you have some time, can you please take a look at the documentation that explains how to do this that I wrote as part of the PR #3211? Specifically, can you read over the new section shown in this PR branch at:

and then comment in that PR if you see any problems or if anything is not clear? Of you don't find anything wrong, can you please approve that PR so that we can merge it?

Then hopefully it will be straightforward how to disable these specific tests for the builds on 'mutrino' (which you can do in a new PR).

Note the sub-process Temporarily disable the failing code or test .

This PR #3211 is just providing the technical details on how to disable the test.

bartlettroscoe added a commit that referenced this issue Aug 1, 2018
Hopefully this documentation will allow any Trilinos developer to selectively
disable tests for the ATDM Trilinos builds.

Hopefully this documentation will allow a Trilinos developer to disable tests
as part of #3183 but this will be used in many future issues as well.
@krcb
Copy link
Contributor

krcb commented Aug 1, 2018 via email

bartlettroscoe pushed a commit that referenced this issue Aug 1, 2018
The executables are built in a parallel build, but they are
really serial so when run there is some extra info that is
output to stderr which ends up messing with the textual comparison
with the expected "gold" output files.  For now disable the tests
until can figure out a better way of running them.

This should address #3183.
@bartlettroscoe
Copy link
Member Author

PR #3213 was just merged that should disable these tests on the 'mutrino' builds. After we get confirmation that these tests are disabled tomorrow, then we can add the "Disabled Tests" labels and move one.

@bartlettroscoe
Copy link
Member Author

@gsjaardema,

The good news is that the 'mutrino' build SEACAS test results today shown here don't show the following tests as failing:

  • SEACASAprepro_aprepro_array_test
  • SEACASAprepro_aprepro_command_line_include_test
  • SEACASAprepro_aprepro_command_line_vars_test
  • SEACASAprepro_aprepro_unit_test
  • SEACASAprepro_lib_aprepro_lib_array_test
  • SEACASAprepro_lib_aprepro_lib_unit_test
  • SEACASExodus_exodus_unit_tests_nc5_env

(because they have been disabled)

The bad news is that it does show that the test SEACASAprepro_aprepro_test_dump_reread is failing in both of the builds. The test SEACASAprepro_aprepro_test_dump_reread appeared in testing on 7/23/2018 (and therefore must have been a PR merged to 'develop' on 7/22/2018) and then started randomly failing as shown in this query.

When the test passes like shown here, it shows:

================================================================================

TEST_3

Running: "diff" "-w" "test-filter.dump" "test-reread.dump"

--------------------------------------------------------------------------------


--------------------------------------------------------------------------------

TEST_3: Return code = 0
TEST_3: Pass criteria = Zero return code [PASSED]
TEST_3: Result = PASSED

================================================================================

when it fails like shown here, it shows:

================================================================================

TEST_3

Running: "diff" "-w" "test-filter.dump" "test-reread.dump"

--------------------------------------------------------------------------------

1,2c1,2
< Thu Aug  2 11:20:16 2018: [unset]:_pmi_alps_init:alps_get_placement_info returned with error -1
< Thu Aug  2 11:20:16 2018: [unset]:_pmi_init:_pmi_alps_init returned -1
---
> Thu Aug  2 11:20:17 2018: [unset]:_pmi_alps_init:alps_get_placement_info returned with error -1
> Thu Aug  2 11:20:17 2018: [unset]:_pmi_init:_pmi_alps_init returned -1

--------------------------------------------------------------------------------

TEST_3: Return code = 1
TEST_3: Pass criteria = Zero return code [FAILED]
TEST_3: Result = FAILED

================================================================================

This looks like the same problem you mentioned above with these annoying STDERR output on 'mutrino'. The reason we did not notice this before was that these are randomly failing tests and the day that I looked at to create this issue did not happen to have these tests failing.

Can we disable this test as well in the 'mutrino' builds? And can we refactor to a single *.cmake file all of the common test disabled as described here?

@gsjaardema
Copy link
Contributor

gsjaardema commented Aug 2, 2018 via email

@bartlettroscoe
Copy link
Member Author

@gsjaardema, I can create a new PR to disable this last test after the PR #3051 gets merged. Otherwise, these PRs could conflict (depending on how smart git/github is).

@bartlettroscoe
Copy link
Member Author

These tests were shown failing again is the build Trilinos-atdm-mutrino-intel-opt-openmp-HSW yesterday as shown here. The disables for these tests is not showing up in the configure output for this build shown here. I will figure out what happened and while I am at it, I will disable the last test SEACASAprepro_aprepro_test_dump_reread.

@bartlettroscoe
Copy link
Member Author

I can't explain what happened, but somehow the commit a133d86:

commit a133d8642643637115b20720926428c852402322
Author:     Joe Frye <jfrye@sandia.gov>
AuthorDate: Wed Jul 18 13:45:59 2018 -0600
Commit:     Roscoe A. Bartlett <rabartl@sandia.gov>
CommitDate: Thu Aug 2 09:27:48 2018 -0600

    Add KOKKOS_ARCH to ATDM_JOB_NAME_KEYS_STR and in all_supported_builds and tweaks files (#2680)
    
    This allows for different tweaks for multiple KOKKOS_ARCH values on the same
    system.
    
    Issue: #2680
    
    This changes the names of the existing builds on 'mutrino' to have an explicit
    HSW in the build name.

R100    cmake/std/atdm/mutrino/tweaks/INTEL-DEBUG-OPENMP.cmake  cmake/std/atdm/mutrino/tweaks/INTEL-DEBUG-OPENMP-HSW.cmake
R100    cmake/std/atdm/mutrino/tweaks/INTEL-RELEASE-OPENMP.cmake        cmake/std/atdm/mutrino/tweaks/INTEL-DEBUG-OPENMP-KNL.cmake
A       cmake/std/atdm/mutrino/tweaks/INTEL-RELEASE-OPENMP-HSW.cmake
A       cmake/std/atdm/mutrino/tweaks/INTEL-RELEASE-OPENMP-KNL.cmake

dropped the disables that @gsjaardema added in commit 5059d8a.

This was a simple file renaming so I can't understand how it deleted these lines. Very scary.

I will add the disables again.

bartlettroscoe added a commit to bartlettroscoe/Trilinos that referenced this issue Aug 7, 2018
This also adds back the disables for several SEACAS tests that got removed
when the file INTEL-RELEASE-OPENMP.cmake got renamed to the file
INTEL-RELEASE-OPENMP-HSW.cmake (not clear how that happened).
@bartlettroscoe
Copy link
Member Author

FYI: PR #3251 should contain the necessary disables and removed duplication.

@gsjaardema or @fryeguy52, can you please approve PR #3251 so that I can merge once PR testing is complete?

bartlettroscoe added a commit that referenced this issue Aug 8, 2018
This also adds back the disables for several SEACAS tests that got removed
when the file INTEL-RELEASE-OPENMP.cmake got renamed to the file
INTEL-RELEASE-OPENMP-HSW.cmake (not clear how that happened).
@bartlettroscoe
Copy link
Member Author

FYI: As shown in this query, the test SEACASAprepro_aprepro_test_dump_reread has not failed in any promoted ATDM Trilinos since 8/2/2018. Perhaps this is resolved?

@bartlettroscoe
Copy link
Member Author

FYI: As shown in this query, the only ATDM group SEACAS tests failing since 8/8/2018 are the tests:

  • SEACASExodus_exodus_unit_tests
  • SEACASExodus_exodus_unit_tests_nc5_env

mostly on 'white' (except for one failure each on 'chama' and 'serrano' for some reason) and the failures on 'white' are addressed in #3288.

In any case, the test failures called out in this Issue seem to be fixed. Therefore, I think we can close this issue.

Closing as complete.

@bartlettroscoe bartlettroscoe added the PA: Data Services Issues that fall under the Trilinos Data Services Product Area label Nov 30, 2018
tjfulle pushed a commit to tjfulle/Trilinos that referenced this issue Dec 6, 2018
…rilinos#3211)

Hopefully this documentation will allow any Trilinos developer to selectively
disable tests for the ATDM Trilinos builds.

Hopefully this documentation will allow a Trilinos developer to disable tests
as part of trilinos#3183 but this will be used in many future issues as well.
tjfulle pushed a commit to tjfulle/Trilinos that referenced this issue Dec 6, 2018
The executables are built in a parallel build, but they are
really serial so when run there is some extra info that is
output to stderr which ends up messing with the textual comparison
with the expected "gold" output files.  For now disable the tests
until can figure out a better way of running them.

This should address trilinos#3183.
tjfulle pushed a commit to tjfulle/Trilinos that referenced this issue Dec 6, 2018
…linos#3251)

This also adds back the disables for several SEACAS tests that got removed
when the file INTEL-RELEASE-OPENMP.cmake got renamed to the file
INTEL-RELEASE-OPENMP-HSW.cmake (not clear how that happened).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
client: ATDM Any issue primarily impacting the ATDM project PA: Data Services Issues that fall under the Trilinos Data Services Product Area pkg: seacas type: bug The primary issue is a bug in Trilinos code or tests
Projects
None yet
Development

No branches or pull requests

3 participants