Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

failure in ud/uct_flush_test.am_zcopy_flush_ep_nb/1 #2047

Open
brminich opened this issue Dec 7, 2017 · 1 comment
Open

failure in ud/uct_flush_test.am_zcopy_flush_ep_nb/1 #2047

brminich opened this issue Dec 7, 2017 · 1 comment

Comments

@brminich
Copy link
Contributor

brminich commented Dec 7, 2017

http://hpc-master.lab.mtl.com:8080/job/hpc-ucx-pr/5397/label=hpc-test-node,worker=2/consoleFull#98030106524dff065-0424-4e55-b698-eb134734d522

22:10:53 [ RUN      ] ud/uct_flush_test.am_zcopy_flush_ep_nb/1
22:10:53 [hpc-test-node:31879:0]    ud_verbs.c:305  Fatal: Send completion (wr_id=0xFAAFFAAF with error: local protection error 
22:10:54 
22:10:54 /scrap/jenkins/workspace/hpc-ucx-pr-3/label/hpc-test-node/worker/2/contrib/../src/uct/ib/ud/verbs/ud_verbs.c: [ uct_ud_verbs_iface_poll_tx() ]
22:10:54       ...
22:10:54       301 
22:10:54       302     if (ucs_unlikely(wc.status != IBV_WC_SUCCESS)) {
22:10:54       303         ucs_fatal("Send completion (wr_id=0x%0X with error: %s ",
22:10:54 ==>   304                   (unsigned)wc.wr_id, ibv_wc_status_str(wc.status));
22:10:54       305         return 0;
22:10:54       306     }
22:10:54       307 
22:10:54 
22:10:54 ==== backtrace ====
22:10:54  0 0x0000000000074d1a uct_ud_verbs_iface_poll_tx()  /scrap/jenkins/workspace/hpc-ucx-pr-3/label/hpc-test-node/worker/2/contrib/../src/uct/ib/ud/verbs/ud_verbs.c:304
22:10:54  1 0x00000000005273da uct_test::progress()  /scrap/jenkins/workspace/hpc-ucx-pr-3/label/hpc-test-node/worker/2/contrib/../src/ucs/datastruct/callbackq.h:168
22:10:54  2 0x00000000004b2b68 uct_flush_test::flush_ep_nb()  /scrap/jenkins/workspace/hpc-ucx-pr-3/label/hpc-test-node/worker/2/contrib/../test/gtest/uct/test_flush.cc:273
22:10:54  3 0x00000000004b75a9 uct_flush_test::test_flush_am_zcopy()  /scrap/jenkins/workspace/hpc-ucx-pr-3/label/hpc-test-node/worker/2/contrib/../test/gtest/uct/test_flush.cc:193
22:10:54  4 0x00000000004b19e3 uct_flush_test_am_zcopy_flush_ep_nb_Test::test_body()  /scrap/jenkins/workspace/hpc-ucx-pr-3/label/hpc-test-node/worker/2/contrib/../test/gtest/uct/test_flush.cc:505
22:10:54  5 0x000000000046fd26 ucs::test_base::run()  /scrap/jenkins/workspace/hpc-ucx-pr-3/label/hpc-test-node/worker/2/contrib/../test/gtest/common/test.cc:249
22:10:54  6 0x0000000000467343 testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void>()  /scrap/jenkins/workspace/hpc-ucx-pr-3/label/hpc-test-node/worker/2/contrib/../test/gtest/common/gtest-all.cc:3562
22:10:54  7 0x000000000045b7bd testing::Test::Run()  /scrap/jenkins/workspace/hpc-ucx-pr-3/label/hpc-test-node/worker/2/contrib/../test/gtest/common/gtest-all.cc:3635
22:10:54  8 0x000000000045b88c testing::TestInfo::Run()  /scrap/jenkins/workspace/hpc-ucx-pr-3/label/hpc-test-node/worker/2/contrib/../test/gtest/common/gtest-all.cc:3812
22:10:54  9 0x000000000045b9ef testing::TestCase::Run()  /scrap/jenkins/workspace/hpc-ucx-pr-3/label/hpc-test-node/worker/2/contrib/../test/gtest/common/gtest-all.cc:3930
22:10:54 10 0x0000000000460387 testing::internal::UnitTestImpl::RunAllTests()  /scrap/jenkins/workspace/hpc-ucx-pr-3/label/hpc-test-node/worker/2/contrib/../test/gtest/common/gtest-all.cc:5802
22:10:54 11 0x000000000046068b testing::internal::UnitTestImpl::RunAllTests()  /scrap/jenkins/workspace/hpc-ucx-pr-3/label/hpc-test-node/worker/2/contrib/../test/gtest/common/gtest-all.cc:5719
22:10:54 12 0x000000000040f193 main()  /scrap/jenkins/workspace/hpc-ucx-pr-3/label/hpc-test-node/worker/2/contrib/../test/gtest/common/gtest.h:20059
22:10:54 13 0x0000000000021c05 __libc_start_main()  ???:0
22:10:54 14 0x0000000000445f48 _start()  ???:0
22:10:54 ===================
22:10:54 Sending notification to mikhailb@mellanox.com
@yosefe
Copy link
Contributor

yosefe commented Jan 23, 2018

happens because UD force-close does not clean up the QP, and doesnt handle send completion with error

@yosefe yosefe self-assigned this Feb 28, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants