Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Build from source on PowerPC 9: no cuda-version package #905

Closed
CharlelieLrt opened this issue Dec 5, 2023 · 11 comments
Closed

Build from source on PowerPC 9: no cuda-version package #905

CharlelieLrt opened this issue Dec 5, 2023 · 11 comments

Comments

@CharlelieLrt
Copy link

I am trying to build legate from source on Lassen (PowerPC9, OS: RHEL 7.9 Maipo) following instructions in the quickstart.

I generate a config file for my conda environment with ./scripts/generate-conda-envs.py --python 3.10 --ctk 11.8 --os linux. The config file is:

name: legate-test
channels:
  - conda-forge
  - nvidia
dependencies:

  - python=3.10,!=3.9.7  # avoid https://bugs.python.org/issue45121

  # cuda
  - cuda-version=11.8
  - cutensor>=1.3.3,<2
  - nccl
  - pynvml

  # build
  - cmake>=3.24,!=3.25.0
  - cython
  - elfutils
  - git
  - make
  - ninja
  - numba
  - openssl
  - pkg-config
  - rust
  - scikit-build>=0.13.1
  - setuptools>=60
  - zlib

  # runtime
  - cffi
  - llvm-openmp
  - numpy>=1.22
  - libblas=*=*openblas*
  - openblas=*=*openmp*
  - openblas<=0.3.21
  - opt_einsum
  - scipy
  - typing_extensions

  # tests
  - clang-tools>=8
  - clang>=8
  - colorama
  - coverage
  - mock
  - mypy>=0.961
  - pre-commit
  - pytest-cov
  - pytest-lazy-fixture
  - pytest-mock
  - pytest
  - types-docutils
  - pynvml
  - tifffile

  # docs
  - pandoc
  - doxygen
  - ipython
  - jinja2
  - markdown<3.4.0
  - pydata-sphinx-theme>=0.13
  - myst-parser
  - nbsphinx
  - sphinx-copybutton
  - sphinx>=4.4.0

Trying to create the environmnet gives the following error:

ResolvePackageNotFound: 
  - pydata-sphinx-theme[version='>=0.13']
  - cuda-version=11.8

In addition, running conda search cuda-version -c nvidia -c conda-forge suggests that the cuda-version package does not exist in these channels.

@manopapad
Copy link
Contributor

@CharlelieLrt can you share the full output from the command, and also conda --version? Can you also try with mamba?

The missing packages are not under linux-ppc64le, but they are under noarch, so that should be sufficient for conda to find and use them, even if you're on a PowerPC platform. @m3vaz any idea what might have happened here?

@CharlelieLrt
Copy link
Author

CharlelieLrt commented Dec 5, 2023

Version is conda 4.6.14
I switched to mamba and it could find the cuda-version package. I could then install dependencies (except the ones for the docs, but I won't need them).

I am now trying to install legate with:

./install.py --cuda --arch volta --network gasnet1 --max-dim 5 --openmp --hdf5 --build-tests --build-examples --conduit ibv, but I get an error telling me that the version of cmake I am using is incompatible:

CMake Error at CMakeLists.txt:17 (cmake_minimum_required):
CMake 3.22.1 or higher is required.  You are running version 3.17.5

My PATH is:

/g/g92/laurent3/miniforge3/envs/legate_base/bin:...

So, I looked at the cmake I have there, and /g/g92/laurent3/miniforge3/envs/legate_base/bin/cmake --version shows cmake version 3.27.9. On the contrary, the command cmake3 --version shows cmake3 version 3.17.5 which is installed in /usr/bin/cmake3. So, I assume that install.py is trying to use this system-wide install of cmake instead of the one in my conda environment. I tried providing an extra argument --with-cmake /g/g92/laurent3/miniforge3/envs/legate_base/bin/cmake to install.py, but it did not change anything.

I believe this was mentioned in #837

@manopapad
Copy link
Contributor

I pushed a fix here, could you please try that? #908

@CharlelieLrt
Copy link
Author

CharlelieLrt commented Dec 6, 2023

It did not solve the problem. Now I see:

[...]
conduit: ibv
gasnet_system: None
nccl_dir: None
cmake_exe: /g/g92/laurent3/miniforge3/envs/legate_base/bin/cmake
cmake_generator: Ninja
[...]

But later on:

  Configuring Project
    Working directory:
      /usr/WS1/laurent3/Codes/LEGATE/legate.core/_skbuild/linux-ppc64le-3.10/cmake-build
    Command:
      /usr/bin/cmake3 /usr/WS1/laurent3/Codes/LEGATE/legate.core -G Ninja [...]

So it's still trying to use the system's cmake3

Could it be because pip --global-option is depreceated? (pypa/pip#11859)

@CharlelieLrt
Copy link
Author

CharlelieLrt commented Dec 7, 2023

As a temporary workaround I have defined a symlink for cmake3 to the right cmake.
I am now running into a cuda compilation error:

      Finished release [optimized] target(s) in 2m 44s
  [98/261] /usr/tce/packages/cuda/cuda-12.0.0/bin/nvcc -forward-unknown-to-host-compiler -DLEGATE_USE_CUDA -DLEGATE_USE_NETWORK -DLEGATE_USE_OPENMP -DTHRUST_DEVICE_SYSTEM=THRUST_DEVICE_SYSTEM_CUDA -DTHRUST_HOST_SYSTEM=THRUST_HOST_SYSTEM_CPP -DUSE_CUDA -DUSE_HDF -Dlegate_core_EXPORTS -I/usr/WS1/laurent3/Codes/LEGATE/legate.core/src -I/usr/WS1/laurent3/Codes/LEGATE/legate.core/_skbuild/linux-ppc64le-3.10/cmake-build/_deps/legion-src/runtime -I/usr/WS1/laurent3/Codes/LEGATE/legate.core/_skbuild/linux-ppc64le-3.10/cmake-build/_deps/legion-src/runtime/mappers -I/usr/WS1/laurent3/Codes/LEGATE/legate.core/_skbuild/linux-ppc64le-3.10/cmake-build/_deps/legion-build/runtime -I/usr/WS1/laurent3/Codes/LEGATE/legate.core/_skbuild/linux-ppc64le-3.10/cmake-build/_deps/thrust-src -I/usr/WS1/laurent3/Codes/LEGATE/legate.core/_skbuild/linux-ppc64le-3.10/cmake-build/_deps/thrust-src/dependencies/cub -isystem /g/g92/laurent3/miniforge3/envs/legate_base/include -isystem /usr/tce/packages/cuda/cuda-12.0.0/nvidia/include -isystem /usr/tce/packages/spectrum-mpi/ibm/spectrum-mpi-2020.08.19/include -isystem /usr/tce/packages/spectrum-mpi/ibm/spectrum-mpi-rolling-release/include -O2 -std=c++17 "--generate-code=arch=compute_70,code=[sm_70]" -Xcompiler=-fPIC -Xfatbin=-compress-all --expt-extended-lambda --expt-relaxed-constexpr -Wno-deprecated-gpu-targets -Xcompiler -pthread -MD -MT legate-core-cpp/CMakeFiles/legate_core.dir/src/core/cuda/stream_pool.cu.o -MF legate-core-cpp/CMakeFiles/legate_core.dir/src/core/cuda/stream_pool.cu.o.d -x cu -c /usr/WS1/laurent3/Codes/LEGATE/legate.core/src/core/cuda/stream_pool.cu -o legate-core-cpp/CMakeFiles/legate_core.dir/src/core/cuda/stream_pool.cu.o
  FAILED: legate-core-cpp/CMakeFiles/legate_core.dir/src/core/cuda/stream_pool.cu.o
  /usr/tce/packages/cuda/cuda-12.0.0/bin/nvcc -forward-unknown-to-host-compiler -DLEGATE_USE_CUDA -DLEGATE_USE_NETWORK -DLEGATE_USE_OPENMP -DTHRUST_DEVICE_SYSTEM=THRUST_DEVICE_SYSTEM_CUDA -DTHRUST_HOST_SYSTEM=THRUST_HOST_SYSTEM_CPP -DUSE_CUDA -DUSE_HDF -Dlegate_core_EXPORTS -I/usr/WS1/laurent3/Codes/LEGATE/legate.core/src -I/usr/WS1/laurent3/Codes/LEGATE/legate.core/_skbuild/linux-ppc64le-3.10/cmake-build/_deps/legion-src/runtime -I/usr/WS1/laurent3/Codes/LEGATE/legate.core/_skbuild/linux-ppc64le-3.10/cmake-build/_deps/legion-src/runtime/mappers -I/usr/WS1/laurent3/Codes/LEGATE/legate.core/_skbuild/linux-ppc64le-3.10/cmake-build/_deps/legion-build/runtime -I/usr/WS1/laurent3/Codes/LEGATE/legate.core/_skbuild/linux-ppc64le-3.10/cmake-build/_deps/thrust-src -I/usr/WS1/laurent3/Codes/LEGATE/legate.core/_skbuild/linux-ppc64le-3.10/cmake-build/_deps/thrust-src/dependencies/cub -isystem /g/g92/laurent3/miniforge3/envs/legate_base/include -isystem /usr/tce/packages/cuda/cuda-12.0.0/nvidia/include -isystem /usr/tce/packages/spectrum-mpi/ibm/spectrum-mpi-2020.08.19/include -isystem /usr/tce/packages/spectrum-mpi/ibm/spectrum-mpi-rolling-release/include -O2 -std=c++17 "--generate-code=arch=compute_70,code=[sm_70]" -Xcompiler=-fPIC -Xfatbin=-compress-all --expt-extended-lambda --expt-relaxed-constexpr -Wno-deprecated-gpu-targets -Xcompiler -pthread -MD -MT legate-core-cpp/CMakeFiles/legate_core.dir/src/core/cuda/stream_pool.cu.o -MF legate-core-cpp/CMakeFiles/legate_core.dir/src/core/cuda/stream_pool.cu.o.d -x cu -c /usr/WS1/laurent3/Codes/LEGATE/legate.core/src/core/cuda/stream_pool.cu -o legate-core-cpp/CMakeFiles/legate_core.dir/src/core/cuda/stream_pool.cu.o
  /usr/include/sys/platform/ppc.h(31): error: identifier "__builtin_ppc_get_timebase" is undefined

I am loading cuda 12.0.0 with module load cuda/12.0.0 and my conda environment was generated with --ctk 12.0

@manopapad
Copy link
Contributor

So it's still trying to use the system's cmake3
Could it be because pip --global-option is depreceated? (pypa/pip#11859)

I posted some follow-up comments on #908. This falls beyond my (very limited) knowledge around python packaging.

error: identifier "__builtin_ppc_get_timebase" is undefined

What host compiler are you using? if you try compiling an empty file with /usr/tce/packages/cuda/cuda-12.0.0/bin/nvcc --verbose empty.cu you should be able to see what's getting called. E.g. on my local machine I see

#$ gcc -D__CUDA_ARCH_LIST__=520 -E -x c++ -D__CUDACC__ -D__NVCC__  "-I/usr/local/cuda/bin/../targets/x86_64-linux/include"    -D__CUDACC_VER_MAJOR__=12 -D__CUDACC_VER_MINOR__=3 -D__CUDACC_VER_BUILD__=103 -D__CUDA_API_VER_MAJOR__=12 -D__CUDA_API_VER_MINOR__=3 -D__NVCC_DIAG_PRAGMA_SUPPORT__=1 -include "cuda_runtime.h" -m64 "a.cu" -o "/tmp/tmpxft_0023d560_00000000-5_a.cpp4.ii"

Are the pure-C++ files compiling correctly? What compiler are they using?

@CharlelieLrt
Copy link
Author

Trying to compile empty.cu, I get:

#$ gcc -D__NV_NO_HOST_COMPILER_CHECK=1 -std=c++14 -D__CUDA_ARCH_LIST__=520 -E -x c++ -D__CUDACC__ -D__NVCC__  "-I/usr/tce/packages/cuda/cuda-12.0.0/nvidia/bin/../targets/ppc64le-linux/include"    -D__CUDACC_VER_MAJOR__=12 -D__CUDACC_VER_MINOR__=0 -D__CUDACC_VER_BUILD__=76 -D__CUDA_API_VER_MAJOR__=12 -D__CUDA_API_VER_MINOR__=0 -D__NVCC_DIAG_PRAGMA_SUPPORT__=1 -include "cuda_runtime.h" "empty.cu" -o "/var/tmp/laurent3/tmpxft_00003738_00000000-5_empty.cpp4.ii"

The pure C++ files seem to be compiled correctly. They use /usr/tce/packages/gcc/gcc-8.3.1/bin/c++ (c++ (GCC) 8.3.1 20190311 (Red Hat 8.3.1-3)).

@manopapad
Copy link
Contributor

I reached out to compiler experts inside Nvidia and on the Legion Zulip for guidance. Unfortunately I don't have easy access to ppc64le machines to try and personally reproduce.

@CharlelieLrt
Copy link
Author

So, given the comments on the Legion Zulip, I switched to a newer commit of Legion (d7121f886127e41773a283cbbaa51c452cd01054) that includes the fix for the __builtin_ppc_get_timebase error.

I now have a bunch of failed compilation, such as:

  FAILED: legate-core-cpp/CMakeFiles/legate_core.dir/src/core/task/variant_options.cc.o
  /usr/tce/packages/gcc/gcc-8.3.1/bin/c++ -DLEGATE_USE_COLLECTIVE -DLEGATE_USE_CUDA -DLEGATE_USE_NETWORK -DLEGATE_USE_OPENMP -DTHRUST_DEVICE_SYSTEM=THRUST_DEVICE_SYSTEM_CUDA -DTHRUST_HOST_SYSTEM=THRUST_HOST_SYSTEM_CPP -DUSE_CUDA -DUSE_HDF -Dlegate_core_EXPORTS -I/usr/WS1/laurent3/Codes/LEGATE/legate.core/src -I/usr/WS1/laurent3/Codes/LEGATE/legate.core/_skbuild/linux-ppc64le-3.10/cmake-build/_deps/legion-src/runtime -I/usr/WS1/laurent3/Codes/LEGATE/legate.core/_skbuild/linux-ppc64le-3.10/cmake-build/_deps/legion-src/runtime/mappers -I/usr/WS1/laurent3/Codes/LEGATE/legate.core/_skbuild/linux-ppc64le-3.10/cmake-build/_deps/legion-build/runtime -I/usr/WS1/laurent3/Codes/LEGATE/legate.core/_skbuild/linux-ppc64le-3.10/cmake-build/_deps/thrust-src -I/usr/WS1/laurent3/Codes/LEGATE/legate.core/_skbuild/linux-ppc64le-3.10/cmake-build/_deps/thrust-src/dependencies/cub -isystem /g/g92/laurent3/miniforge3/envs/legate_base/include -isystem /usr/tce/packages/cuda/cuda-12.0.0/nvidia/include -isystem /usr/tce/packages/spectrum-mpi/ibm/spectrum-mpi-2020.08.19/include -isystem /usr/tce/packages/spectrum-mpi/ibm/spectrum-mpi-rolling-release/include -O2 -std=gnu++17 -fPIC -mcpu=native -maltivec -mabi=altivec -mvsx -UTHRUST_DEVICE_SYSTEM -DTHRUST_DEVICE_SYSTEM=THRUST_DEVICE_SYSTEM_OMP -pthread -MD -MT legate-core-cpp/CMakeFiles/legate_core.dir/src/core/task/variant_options.cc.o -MF legate-core-cpp/CMakeFiles/legate_core.dir/src/core/task/variant_options.cc.o.d -o legate-core-cpp/CMakeFiles/legate_core.dir/src/core/task/variant_options.cc.o -c /usr/WS1/laurent3/Codes/LEGATE/legate.core/src/core/task/variant_options.cc
  /usr/WS1/laurent3/Codes/LEGATE/legate.core/src/core/task/variant_options.cc: In member function 'void legate::VariantOptions::populate_registrar(Legion::TaskVariantRegistrar&)':
  /usr/WS1/laurent3/Codes/LEGATE/legate.core/src/core/task/variant_options.cc:56:13: error: 'struct Legion::TaskVariantRegistrar' has no member named 'set_concurrent'; did you mean 'add_constraint'?
     registrar.set_concurrent(concurrent);
               ^~~~~~~~~~~~~~
               add_constraint

Or:

  FAILED: legate-core-cpp/CMakeFiles/legate_core.dir/src/core/comm/comm_nccl.cu.o
  /usr/tce/packages/cuda/cuda-12.0.0/bin/nvcc -forward-unknown-to-host-compiler -DLEGATE_USE_CUDA -DLEGATE_USE_NETWORK -DLEGATE_USE_OPENMP -DTHRUST_DEVICE_SYSTEM=THRUST_DEVICE_SYSTEM_CUDA -DTHRUST_HOST_SYSTEM=THRUST_HOST_SYSTEM_CPP -DUSE_CUDA -DUSE_HDF -Dlegate_core_EXPORTS -I/usr/WS1/laurent3/Codes/LEGATE/legate.core/src -I/usr/WS1/laurent3/Codes/LEGATE/legate.core/_skbuild/linux-ppc64le-3.10/cmake-build/_deps/legion-src/runtime -I/usr/WS1/laurent3/Codes/LEGATE/legate.core/_skbuild/linux-ppc64le-3.10/cmake-build/_deps/legion-src/runtime/mappers -I/usr/WS1/laurent3/Codes/LEGATE/legate.core/_skbuild/linux-ppc64le-3.10/cmake-build/_deps/legion-build/runtime -I/usr/WS1/laurent3/Codes/LEGATE/legate.core/_skbuild/linux-ppc64le-3.10/cmake-build/_deps/thrust-src -I/usr/WS1/laurent3/Codes/LEGATE/legate.core/_skbuild/linux-ppc64le-3.10/cmake-build/_deps/thrust-src/dependencies/cub -isystem /g/g92/laurent3/miniforge3/envs/legate_base/include -isystem /usr/tce/packages/cuda/cuda-12.0.0/nvidia/include -isystem /usr/tce/packages/spectrum-mpi/ibm/spectrum-mpi-2020.08.19/include -isystem /usr/tce/packages/spectrum-mpi/ibm/spectrum-mpi-rolling-release/include -O2 -std=c++17 "--generate-code=arch=compute_70,code=[sm_70]" -Xcompiler=-fPIC -Xfatbin=-compress-all --expt-extended-lambda --expt-relaxed-constexpr -Wno-deprecated-gpu-targets -Xcompiler -pthread -MD -MT legate-core-cpp/CMakeFiles/legate_core.dir/src/core/comm/comm_nccl.cu.o -MF legate-core-cpp/CMakeFiles/legate_core.dir/src/core/comm/comm_nccl.cu.o.d -x cu -c /usr/WS1/laurent3/Codes/LEGATE/legate.core/src/core/comm/comm_nccl.cu -o legate-core-cpp/CMakeFiles/legate_core.dir/src/core/comm/comm_nccl.cu.o
  /usr/WS1/laurent3/Codes/LEGATE/legate.core/src/core/data/store.h(174): error: namespace "Legion" has no member "OutputRegion"

  /usr/WS1/laurent3/Codes/LEGATE/legate.core/src/core/data/store.h(205): error: namespace "Legion" has no member "OutputRegion"

  /usr/WS1/laurent3/Codes/LEGATE/legate.core/src/core/utilities/deserializer.h(107): error: namespace "Legion" has no member "OutputRegion"

  /usr/tce/packages/gcc/gcc-8.3.1/rh/usr/include/c++/8/bits/ptr_traits.h(114): error: static assertion failed with "pointer type defines element_type or is like SomePointer<T, Args>"
            detected during:
              instantiation of class "std::pointer_traits<_Ptr> [with _Ptr=<error-type> *]"
  /usr/tce/packages/gcc/gcc-8.3.1/rh/usr/include/c++/8/bits/alloc_traits.h(102): here
              instantiation of class "std::allocator_traits<_Alloc>::_Ptr<_Func, _Tp, <unnamed>> [with _Alloc=std::allocator<<error-type>>, _Func=std::__allocator_traits_base::__c_pointer, _Tp=const <error-type>, <unnamed>=void]"
  /usr/tce/packages/gcc/gcc-8.3.1/rh/usr/include/c++/8/bits/alloc_traits.h(135): here
              instantiation of class "std::allocator_traits<_Alloc> [with _Alloc=std::allocator<<error-type>>]"
  /usr/tce/packages/gcc/gcc-8.3.1/rh/usr/include/c++/8/ext/alloc_traits.h(52): here
              instantiation of class "__gnu_cxx::__alloc_traits<_Alloc, <unnamed>> [with _Alloc=std::allocator<<error-type>>, <unnamed>=<error-type>]"
  /usr/tce/packages/gcc/gcc-8.3.1/rh/usr/include/c++/8/bits/stl_vector.h(84): here
              instantiation of class "std::_Vector_base<_Tp, _Alloc> [with _Tp=<error-type>, _Alloc=std::allocator<<error-type>>]"
  /usr/tce/packages/gcc/gcc-8.3.1/rh/usr/include/c++/8/bits/stl_vector.h(339): here
              instantiation of class "std::vector<_Tp, _Alloc> [with _Tp=<error-type>, _Alloc=std::allocator<<error-type>>]"
  /usr/WS1/laurent3/Codes/LEGATE/legate.core/src/core/utilities/deserializer.h(107): here

(and many other)

@manopapad
Copy link
Contributor

Can you try with top-of-tree control_replication branch?

@CharlelieLrt
Copy link
Author

Legion commit 04ee5be1dc3b742f195348c78458450f5dd35f44 worked, and no further problem to compile cunumeric, so everything is good (except the few things already mentioned above).

Thanks for your help with this!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants