[CI][1.x] Cherrypick: Upgrade unix gpu toolchain (#18186) #18785

ChaiBapchya · 2020-07-24T18:26:53Z

Leverage G4 instances for unix-gpu instead of G3

update nvidiadocker command & remove cuda compat
replace cu101 with cuda since compat is no longer to be used
skip flaky tests
get rid of ubuntu_build_cuda and point ubuntu_cu101 to base gpu instead of cuda compat
Revert "skip flaky tests"

This reverts commit 1c720fa.

revert removal of ubuntu_build_cuda
add linux gpu g4 node to all steps using g3 in unix-gpu pipeline

mxnet-bot · 2020-07-24T18:26:55Z

Hey @ChaiBapchya , Thanks for submitting the PR
All tests are already queued to run once. If tests fail, you can trigger one or more tests again with the following commands:

To trigger all jobs: @mxnet-bot run ci [all]
To trigger specific jobs: @mxnet-bot run ci [job1, job2]

CI supported jobs: [centos-cpu, clang, edge, centos-gpu, unix-gpu, website, windows-cpu, miscellaneous, sanity, unix-cpu, windows-gpu]

Note:
Only following 3 categories can trigger CI :PR Author, MXNet Committer, Jenkins Admin.
All CI tests must pass before the PR can be merged.

ChaiBapchya · 2020-07-26T01:13:25Z

@mxnet-bot run ci [unix-gpu]

ChaiBapchya · 2020-07-27T06:35:55Z

I run this locally to try & reproduce the CI error but it passes & doesn't throw the nvidia-docker error.

ci/build.py --docker-registry mxnetci --nvidiadocker --platform ubuntu_gpu_cu101 --docker-build-retries 3 --shm-size 500m /work/runtime_functions.sh unittest_ubuntu_cpugpu_perl

@leezu @josephevans any idea?

I can confirm it translates into equivalent command of

docker \
        run \
        --gpus all \
        --cap-add \
        SYS_PTRACE \
        --rm \
        --shm-size=500m \
        -v \
        /home/ubuntu/chai-mxnet:/work/mxnet \
        -v \
        /home/ubuntu/chai-mxnet/build:/work/build \
        -v \
        /home/ubuntu/.ccache:/work/ccache \
        -u \
        1000:1000 \
        -e \
        CCACHE_MAXSIZE=500G \
        -e \
        CCACHE_TEMPDIR=/tmp/ccache \
        -e \
        CCACHE_DIR=/work/ccache \
        -e \
        CCACHE_LOGFILE=/tmp/ccache.log \
        -ti \
        mxnetci/build.ubuntu_gpu_cu101 \
        /work/runtime_functions.sh \
unittest_ubuntu_cpugpu_perl

leezu · 2020-07-27T17:13:38Z

[2020-07-26T08:47:43.081Z] FileNotFoundError: [Errno 2] No such file or directory: 'nvidia-docker': 'nvidia-docker'

That's related to the AMI. You could also update the build.py script to run docker run --gpus=all instead of nvidia-docker

ChaiBapchya · 2020-07-27T17:16:48Z

Yes but I've updated the Jenkinsfile_unix_gpu to use G4 instance which has the updated dockerversion [our master pipeline is using G4 instance which has updated docker version]
Moreover, build.py is updated in this PR too
Hence when I run it locally it properly translates to docker run --gpus all command

ChaiBapchya · 2020-07-27T19:07:37Z

Nevermind. @josephevans helped me identify that before calling the run_container it was building that docker container first and while building it was using nvidia-docker via get_docker_binary and that needs to be removed as well. Dropped it.

szha · 2020-07-30T18:12:21Z

I think when enabling the branch protection, we accidentally turned on "Require branches to be up to date before merging". I'm requesting to disable it in https://issues.apache.org/jira/browse/INFRA-20616. Don't worry about updating the branch in this PR for now.

ChaiBapchya · 2020-08-02T03:23:40Z

@mxnet-bot run ci [unix-gpu]
Now that Apache Infra team has resolved https://issues.apache.org/jira/browse/INFRA-20616

mxnet-bot · 2020-08-02T03:23:47Z

Jenkins CI successfully triggered : [unix-gpu]

sandeep-krishnamurthy · 2020-08-14T22:00:04Z

@mxnet-bot run ci [unix-gpu]

mxnet-bot · 2020-08-14T22:00:08Z

Jenkins CI successfully triggered : [unix-gpu]

* update nvidiadocker command & remove cuda compat * replace cu101 with cuda since compat is no longer to be used * skip flaky tests * get rid of ubuntu_build_cuda and point ubuntu_cu101 to base gpu instead of cuda compat * Revert "skip flaky tests" This reverts commit 1c720fa. * revert removal of ubuntu_build_cuda * add linux gpu g4 node to all steps using g3 in unix-gpu pipeline

:q

* Remove mention of nightly in pypi (apache#18635) * update bert dev.tsv link Co-authored-by: Sheng Zha <szha@users.noreply.github.com>

ChaiBapchya · 2020-08-16T19:09:43Z

ubuntu_gpu_cu101 on 1.x branch relies on libcuda compat. However, for upgrading from G3 to G4 instance, we no longer rely on libcuda compat. It gives cuda driver/display driver error if using libcuda compat.

Upon removing the LD_LIBRARY_PATH kludge for libcuda compat, 4 builds in unix-gpu pipeline failed due to TVM=ON relies on libcuda compat.
PR #18204 has disabled TVM in master branch due to known issue.
Hence doing the same for v1.x branch.

Note: I haven't cherry-picked that PR because master branch CI has differences from v1.x [for e.g. most builds in unix-gpu for master branch have cmake instead of make]

ChaiBapchya · 2020-08-17T04:14:05Z

@mxnet-bot run ci [unix-gpu] re-triggering for flaky issue.

mxnet-bot · 2020-08-17T04:14:11Z

Jenkins CI successfully triggered : [unix-gpu]

ChaiBapchya · 2020-08-17T16:22:48Z

@jinboci I saw one of your PRs for fixing TVM Op errors.. Any idea why this test fails when using TVM=ON?
It's failing for3 tests: Python3 GPU, Python3 MKLDNN GPU, Python3 MKLDNN-NoCUDNN GPU

Common Stack Trace

test_operator_gpu.test_kernel_error_checking ... terminate called after throwing an instance of 'dmlc::Error'

[2020-08-17T05:59:15.843Z]   what():  [05:59:13] /work/mxnet/3rdparty/tvm/src/runtime/workspace_pool.cc:115: Check failed: allocated_.size() == 1 (3 vs. 1) :

In CI Jenkins_steps.groovy for Python3 GPU
We're packing

compile_unix_full_gpu()
utils.pack_lib('gpu', mx_lib_cpp_examples)

where

mx_lib_cpp_examples = 'lib/libmxnet.so, lib/libmxnet.a, lib/libtvm_runtime.so, lib/libtvmop.so, lib/tvmop.conf, build/libcustomop_lib.so, build/libcustomop_gpu_lib.so, build/libsubgraph_lib.so, 3rdparty/dmlc-core/libdmlc.a, 3rdparty/tvm/nnvm/lib/libnnvm.a, 3rdparty/ps-lite/build/libps.a, deps/lib/libprotobuf-lite.a, deps/lib/libzmq.a, build/cpp-package/example/*, python/mxnet/_cy3/*.so, python/mxnet/_ffi/_cy3/*.so'

While unpacking

test_unix_python3_gpu()
utils.unpack_and_init('gpu', mx_lib_cython)

where mx_lib_cython is a subset of mx_lib_cpp_examples

mx_lib_cython = 'lib/libmxnet.so, lib/libmxnet.a, lib/libtvm_runtime.so, lib/libtvmop.so, lib/tvmop.conf, build/libcustomop_lib.so, build/libcustomop_gpu_lib.so, build/libsubgraph_lib.so, 3rdparty/dmlc-core/libdmlc.a, 3rdparty/tvm/nnvm/lib/libnnvm.a, python/mxnet/_cy3/*.so, python/mxnet/_ffi/_cy3/*.so'

Based on the stacktrace: It's throwing TVM runtime check failed for allocated size
@DickJC123 I see you had submitted this test. Any idea why this is troubling TVM?

leezu · 2020-08-17T18:59:38Z

@ChaiBapchya on master, -DUSE_TVM_OP=ON is disabled for all GPU builds due to known issues. You can disable it on 1.x branch as well.

jinboci · 2020-08-18T06:38:46Z

@ChaiBapchya It seems the unix-gpu test have passed. Most of my work about TVMOp was written in the issue #18716. However, I don't think we were encountering the same problem.

ChaiBapchya · 2020-08-18T07:12:18Z

Ya I've dropped TVMOp support from unix-gpu pipeline and that caused the pipeline to pass.

ChaiBapchya · 2020-08-18T07:13:02Z

@mxnet-label-bot add [pr-awaiting-review]

ChaiBapchya · 2020-08-18T07:36:40Z

@mxnet-bot run ci [windows-gpu] retriggering as windows gpu timed out

mxnet-bot · 2020-08-18T07:36:46Z

Jenkins CI successfully triggered : [windows-gpu]

ChaiBapchya requested review from aaronmarkham and marcoabreu as code owners July 24, 2020 18:26

ChaiBapchya changed the title ~~[CI][1.x] Upgrade unix gpu toolchain (#18186)~~ [CI][1.x] Cherrypick: Upgrade unix gpu toolchain (#18186) Jul 24, 2020

lanking520 added the CI label Jul 26, 2020

ChaiBapchya mentioned this pull request Jul 26, 2020

Get status of checks on a Pull Request PyGithub/PyGithub#1621

Closed

lanking520 added the pr-awaiting-review PR is waiting for code review label Jul 26, 2020

ChaiBapchya force-pushed the g3_to_g4 branch from 97bb27e to ce20a80 Compare August 15, 2020 01:06

ChaiBapchya requested review from anirudh2290, iblislin and szha as code owners August 15, 2020 01:06

ChaiBapchya force-pushed the g3_to_g4 branch from ce20a80 to 3e75e86 Compare August 15, 2020 01:18

ChaiBapchya added 4 commits August 15, 2020 01:26

remove docker compose files

c1b54fb

add back the caffe test since caffe is deprecated for mx2.0 and not 1.x

08674b4

drop nvidia-docker requirement since docker19.0 supports it by default

c6b32ee

:q

ChaiBapchya force-pushed the g3_to_g4 branch from 3e75e86 to c6b32ee Compare August 15, 2020 01:26

ChaiBapchya and others added 3 commits August 14, 2020 18:28

Merge branch 'v1.x' into g3_to_g4

7679140

remove compat from dockerfile

ddad335

Cherry-pick apache#18635 to v1.7.x (apache#18935)

90d3ace

* Remove mention of nightly in pypi (apache#18635) * update bert dev.tsv link Co-authored-by: Sheng Zha <szha@users.noreply.github.com>

ChaiBapchya requested a review from gigasquid as a code owner August 16, 2020 17:04

disable tvm in CI functions that rely on libcuda compat

3fb24fa

ChaiBapchya force-pushed the g3_to_g4 branch from 23ad17c to 3fb24fa Compare August 16, 2020 22:15

tvm off for ubuntu_gpu_cmake build

4ef8db7

ChaiBapchya mentioned this pull request Aug 17, 2020

Flaky test: test_operator_gpu.test_countsketch #10988

Closed

ChaiBapchya mentioned this pull request Aug 17, 2020

test_operator_gpu.test_kernel_error_checking Fails #16353

Open

drop tvm from all unix-gpu builds

232538f

szha merged commit 9981e84 into apache:v1.x Aug 18, 2020

ChaiBapchya deleted the g3_to_g4 branch September 9, 2020 05:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CI][1.x] Cherrypick: Upgrade unix gpu toolchain (#18186) #18785

[CI][1.x] Cherrypick: Upgrade unix gpu toolchain (#18186) #18785

ChaiBapchya commented Jul 24, 2020

mxnet-bot commented Jul 24, 2020

ChaiBapchya commented Jul 26, 2020

ChaiBapchya commented Jul 27, 2020

leezu commented Jul 27, 2020

ChaiBapchya commented Jul 27, 2020

ChaiBapchya commented Jul 27, 2020

szha commented Jul 30, 2020

ChaiBapchya commented Aug 2, 2020

mxnet-bot commented Aug 2, 2020

sandeep-krishnamurthy commented Aug 14, 2020

mxnet-bot commented Aug 14, 2020

ChaiBapchya commented Aug 16, 2020

ChaiBapchya commented Aug 17, 2020

mxnet-bot commented Aug 17, 2020

ChaiBapchya commented Aug 17, 2020

leezu commented Aug 17, 2020

jinboci commented Aug 18, 2020

ChaiBapchya commented Aug 18, 2020

ChaiBapchya commented Aug 18, 2020

ChaiBapchya commented Aug 18, 2020

mxnet-bot commented Aug 18, 2020

[CI][1.x] Cherrypick: Upgrade unix gpu toolchain (#18186) #18785

[CI][1.x] Cherrypick: Upgrade unix gpu toolchain (#18186) #18785

Conversation

ChaiBapchya commented Jul 24, 2020

mxnet-bot commented Jul 24, 2020

ChaiBapchya commented Jul 26, 2020

ChaiBapchya commented Jul 27, 2020

leezu commented Jul 27, 2020

ChaiBapchya commented Jul 27, 2020

ChaiBapchya commented Jul 27, 2020

szha commented Jul 30, 2020

ChaiBapchya commented Aug 2, 2020

mxnet-bot commented Aug 2, 2020

sandeep-krishnamurthy commented Aug 14, 2020

mxnet-bot commented Aug 14, 2020

ChaiBapchya commented Aug 16, 2020

ChaiBapchya commented Aug 17, 2020

mxnet-bot commented Aug 17, 2020

ChaiBapchya commented Aug 17, 2020

leezu commented Aug 17, 2020

jinboci commented Aug 18, 2020

ChaiBapchya commented Aug 18, 2020

ChaiBapchya commented Aug 18, 2020

ChaiBapchya commented Aug 18, 2020

mxnet-bot commented Aug 18, 2020