Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merging batch-dpcpp into batch-develop #1298

Closed
wants to merge 140 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
140 commits
Select commit Hold shift + click to select a range
653a167
Log dpcpp version info
pratikvn Feb 6, 2023
9e78c4a
Add a batch_struct for dpcpp
pratikvn Feb 6, 2023
68e512d
update headers in dpcpp kernels
pratikvn Feb 6, 2023
a1ce184
implemented spmv and advanced_spmv
phu0ngng Jan 24, 2023
41d649c
spmv compiled, but not batch_scale
phu0ngng Feb 8, 2023
e096ed0
Instantiate for int32 only
pratikvn Feb 9, 2023
916545b
bath_csr compiled but did not pass tests
phu0ngng Feb 9, 2023
2221d99
DiagScale still didn't pass the tests
phu0ngng Feb 10, 2023
3d791e4
add batch_csr kernels passed the test
phu0ngng Feb 10, 2023
23e1a37
saved current state
phu0ngng Feb 13, 2023
bb09fff
implemented batch_dense_kernels, all tests passed
phu0ngng Feb 14, 2023
b750f33
implemented batch diagonal kernels
phu0ngng Feb 14, 2023
cec05a8
implemented batch ell kernels
phu0ngng Feb 14, 2023
d14dc0a
Move batch matrix kernels to header files
pratikvn Feb 16, 2023
58f39e0
Add batch preconditioner headers
pratikvn Feb 16, 2023
5a81d3f
Enable batch dpcpp solver dispatch
pratikvn Feb 16, 2023
bda56f6
Add dcppp batch stop criteria
pratikvn Feb 16, 2023
63158c0
Add dpcpp batch loggers
pratikvn Feb 16, 2023
53541c8
Add template batch_cg kernel header
pratikvn Feb 16, 2023
a6e1673
moved item_ct1 as the last argument of device kernels, rm unused head…
phu0ngng Feb 16, 2023
47f1c89
renamed some device functions
phu0ngng Feb 22, 2023
83e9682
implemented basic preconditioners for batch, did not test them yet
phu0ngng Feb 22, 2023
b6f82a1
implemented batch-cg, did not test them yet
phu0ngng Feb 22, 2023
6c52440
tmp save
phu0ngng Feb 22, 2023
69c9c0e
pushed batch_vector_kernels.hpp
phu0ngng Feb 22, 2023
5b4022b
Move .hpp files to .hpp.inc and remove namespaces
pratikvn Feb 23, 2023
e06776a
Usae .hpp.inc in correct namespaces
pratikvn Feb 23, 2023
ca7858d
Fix function signatures and remove exceptions
pratikvn Feb 23, 2023
2102c94
Enable batch_cg dpcpp tests
pratikvn Feb 23, 2023
5c321cb
updated header files
phu0ngng Feb 23, 2023
93135e5
added a new line after licence text
phu0ngng Feb 23, 2023
534b3e4
cg passed 4/7 tests
phu0ngng Feb 27, 2023
23002f6
- `batch_cg` passed 4/7 tests.
phu0ngng Feb 28, 2023
ecba3df
rm old files
phu0ngng Feb 28, 2023
7670da1
batch_cg passed tests
phu0ngng Mar 8, 2023
3d2ca3f
minor changes, these changes need to be redo in the future
phu0ngng Mar 8, 2023
7fe1198
saved current state
phu0ngng Mar 9, 2023
1ec5ba0
batch_cg passed all tests when using slm
phu0ngng Mar 9, 2023
6337228
passing the accessors with their ptrs
phu0ngng Mar 9, 2023
056f002
slm calculation does not involve a matrix_size substraction
phu0ngng Mar 9, 2023
a164182
Revert "passing the accessors with their ptrs"
phu0ngng Mar 9, 2023
de45f31
Revert "slm calculation does not involve a matrix_size substraction"
phu0ngng Mar 10, 2023
4e35f57
passing accessors via its pointers
phu0ngng Mar 10, 2023
ab0c84c
enforcing subgroup_size for all block_jacobi rountines
phu0ngng Mar 13, 2023
50498f3
fixed
phu0ngng Mar 13, 2023
656ba38
fixed
phu0ngng Mar 13, 2023
e8da1d5
- Batch Block Jacobi does not work yet. The issue might come from the
phu0ngng Mar 13, 2023
1488e5c
minor changes for csr single matvec
phu0ngng Mar 13, 2023
415cd02
marked BatchBlockJacobi as NOT_IMPLEMENTED
phu0ngng Mar 13, 2023
63b9280
implemented batch_bicgstab, passed all tests
phu0ngng Mar 13, 2023
15ea037
corrected include path for .hpp.in with cmake preprocessor
phu0ngng Mar 14, 2023
d2888af
inline all 'device' functions
phu0ngng Mar 14, 2023
443d268
Format files
ginkgo-bot Mar 14, 2023
d2ac213
implemented batch_richardson, passed all tests
phu0ngng Mar 15, 2023
489c180
enabled test compilations for dpcpp
phu0ngng Mar 15, 2023
4e136aa
- implemented batch_lower_trs_kernels. For now, the kernel calls the
phu0ngng Mar 15, 2023
4b97df1
implemented batch_upper_trs
phu0ngng Mar 15, 2023
c923247
enabling batch_idr test
phu0ngng Mar 15, 2023
b78895b
wrote onemkl bindings for getrf and getrs
phu0ngng Mar 17, 2023
7f1b5c1
batch_direct is implemented but not compilable yet
phu0ngng Mar 17, 2023
b00cd94
- wrote getrs_batch and getrf_batch in onemkl_bindings.cpp.
phu0ngng Mar 18, 2023
42fc377
tmp saved
phu0ngng Mar 21, 2023
0f17ec5
fixed batch_cg and batch_bicgstab, both kernels should work with larg…
phu0ngng Mar 21, 2023
310dcdf
Add batch_direct tests for dpcpp
pratikvn Mar 24, 2023
df67ec5
Enable dpcpp batch_richardson tests
pratikvn Mar 24, 2023
da5e544
batch_block_jacobi passed test for float and double, but not for cmplx
phu0ngng Mar 29, 2023
5e35df2
added tests for component functions of batch_block_jacobi
phu0ngng Mar 30, 2023
2b83e07
Some fixes for nan output with complex on dpcpp
pratikvn Mar 31, 2023
57756bb
fixed batch_direct, passed all tests now
phu0ngng Apr 3, 2023
afa88d5
implemented batch_gmres, passed 6/7 tests on iris
phu0ngng Apr 3, 2023
ee1e042
fixed slm allocation issue
phu0ngng Apr 4, 2023
225061e
implemented batch_idr, passed 5/6 tests
phu0ngng Apr 4, 2023
4f4e3da
fixed slm allocation in batch_block_jacobi
phu0ngng Apr 5, 2023
400560c
impl batch_ilu, batch_isai, batch_ilu_isai, compilation does not go t…
phu0ngng Apr 11, 2023
902a1fd
added .inc files
phu0ngng Apr 11, 2023
7072b8a
batch_ilu passed all tests. batch_isai and batch_ilu_isai only passed…
phu0ngng Apr 11, 2023
ede82da
batch_isai and batch_ilu_isai passed all tests
phu0ngng Apr 12, 2023
1be2895
Add a run script
pratikvn Apr 12, 2023
392db00
improved performance of batch_cg by ~3x for matsize bigger than 1024
phu0ngng Apr 28, 2023
2584e2e
Remove branching to improve perf for small mats
pratikvn Apr 28, 2023
f694ec4
changes in example: nreps + time in msecs
phu0ngng Apr 28, 2023
2625440
implemented dynamic group_size + clean up
phu0ngng Apr 28, 2023
875d45d
implemented branching for small and big matsize with different group …
phu0ngng May 1, 2023
1cec8c9
fixed sync
phu0ngng May 1, 2023
a913280
implemented kernel for each mat_size range
phu0ngng May 1, 2023
fc65a53
minor changes in order of calculations
phu0ngng May 1, 2023
5be03f0
optimization for batchcg medium kernel
phu0ngng May 4, 2023
3d79d9e
resolved conflicts
phu0ngng May 31, 2023
06e5fa4
allocated alpha once outside of iteration steps
phu0ngng May 31, 2023
3c509b4
templated vecs_shared_all and sg_kernel_only, using lambda to contain…
phu0ngng Jun 6, 2023
e2c36a9
clean batch_vector_kernels
phu0ngng Jun 6, 2023
bfdcf93
templated batch_cg for all optimization parameters
phu0ngng Jun 6, 2023
31145ea
added icpx for sycl compilation into CMakefiles.txt
phu0ngng Jun 7, 2023
fd32fd4
renamed template parameters, add group_size cals
phu0ngng Jun 8, 2023
d1cb464
impled optimizations for batch_gmres and batch_cg, passed all tests
phu0ngng Jun 8, 2023
51eec85
cleanup
phu0ngng Jun 8, 2023
2f047f3
rm a few barriers
phu0ngng Jun 8, 2023
3d590b0
fixed batch_gmres for checking rot and hess storages
phu0ngng Jun 20, 2023
b00d7c9
resolved conflicts
phu0ngng Jun 20, 2023
e0d9281
- Splitting rot storage into individual vectors. Now n_global is 9 (5
phu0ngng Jun 22, 2023
f5dd9ca
adjusted init workgroup size + size range for selecting the kernels
phu0ngng Jun 22, 2023
3d0c0c9
fixed slm overuse issue
phu0ngng Jun 29, 2023
9012f49
- Templated batch_bicgstab and batch_cg.
phu0ngng Jun 30, 2023
ee94478
added batch_ell to example
phu0ngng Jun 30, 2023
1ed08df
templated batch_gmres cuda
Jun 30, 2023
70559bb
added c++17 std
Jul 3, 2023
778a1e4
added cuda templating
Jul 11, 2023
0d86327
rm timer in CMakeLists
phu0ngng Jul 17, 2023
92b0866
fixed CMakeLists so that it includes batched kernels
phu0ngng Jul 18, 2023
f56bcac
undo CMakeLists
phu0ngng Jul 18, 2023
c4b027b
fixed rot vectors' and hess vector' sizes
phu0ngng Jul 18, 2023
bb67f02
make sure num_rot_vecs_shared does not have negative value
phu0ngng Jul 18, 2023
7e3c151
rm prints in cuda batch gmress
phu0ngng Jul 18, 2023
bddf39b
fixed typos
phu0ngng Jul 18, 2023
2aa47b8
added case 0 for batch_bicgstab template
phu0ngng Jul 25, 2023
d7cbd0c
added case 0 for batch cg
phu0ngng Jul 26, 2023
6a00f2e
rm n_shared templating for batch_gmres
phu0ngng Jul 26, 2023
6e03daf
rm time print
phu0ngng Jul 26, 2023
ca819d3
temp: commented out templating in batch gmres
Jul 27, 2023
d29044e
added large test for batch cg
Jul 27, 2023
cdcd35c
cuda: added n_shared template for batch_cg and batch_bicgstab
Jul 27, 2023
f71c98e
fixed raw_free mem issue
Jul 27, 2023
0affb43
cuda: cleaning
Jul 27, 2023
0abd4e1
add pele/sundials matrices
pratikvn Aug 3, 2023
362445b
enabled tests
phu0ngng Aug 3, 2023
f150acd
changed nreps batched-solver.cpp
phu0ngng Aug 3, 2023
26ab22a
added check prec_storage != 0
phu0ngng Aug 3, 2023
103143e
changes batched-solver-from-files to print only time in msec
phu0ngng Aug 3, 2023
1602c9d
optimized work-group selection
phu0ngng Aug 10, 2023
6d71332
added test scripts
phu0ngng Aug 14, 2023
292bc07
renamed get_group_size func, and rewrite 1 0 into true false
phu0ngng Aug 21, 2023
94623a6
formatted
phu0ngng Aug 21, 2023
b39b740
added gmres with n_shared templating
Aug 22, 2023
487d7b1
changed alignment to 2
Aug 22, 2023
1f81ab6
added a large test case for batch_gmres
Aug 22, 2023
a5654c9
Format files
ginkgo-bot Aug 22, 2023
8e01e7f
rebase fixes
pratikvn Oct 15, 2023
2396572
remove sundials mats
pratikvn Oct 15, 2023
ebee316
prefix_sum rename
pratikvn Oct 15, 2023
e3f2bc7
stream and dpcpp fixes
pratikvn Oct 16, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
28 changes: 17 additions & 11 deletions CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -123,11 +123,9 @@ endif()

# For now, PGI/NVHPC nvc++ compiler doesn't seem to support
# `#pragma omp declare reduction`
#
# The math with optimization level -O2 doesn't follow IEEE standard, so we
# enable that back as well.
if (CMAKE_CXX_COMPILER_ID MATCHES "PGI|NVHPC")
set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -Kieee")
if (${CMAKE_CXX_COMPILER_ID} MATCHES "PGI|NVHPC")
message(STATUS "OpenMP: Switching to OFF because PGI/NVHPC nvc++ compiler lacks important features.")
set(GINKGO_BUILD_OMP OFF)
endif()

#Batch Ginkgo options:
Expand Down Expand Up @@ -306,8 +304,16 @@ if(MSVC)
endif()

if(GINKGO_BUILD_DPCPP)
ginkgo_extract_dpcpp_version(${CMAKE_CXX_COMPILER} GINKGO_DPCPP_MAJOR_VERSION __LIBSYCL_MAJOR_VERSION)
ginkgo_extract_dpcpp_version(${CMAKE_CXX_COMPILER} GINKGO_DPCPP_VERSION __SYCL_COMPILER_VERSION)
if(CMAKE_CXX_COMPILER MATCHES "icpx")
set(GINKGO_DPCPP_MAJOR_VERSION "6")
set(GINKGO_DPCPP_VERSION "6")
set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -fsycl")
# AoT
# set(CMAKE_EXE_LINKER_FLAGS "${CMAKE_EXE_LINKER_FLAGS} -fsycl -fsycl-targets=spir64 -Xsycl-target-backend '-device 12.60.7' -fsycl-max-parallel-link-jobs=8 -fsycl-unnamed-lambda")
else()
ginkgo_extract_dpcpp_version(${CMAKE_CXX_COMPILER} GINKGO_DPCPP_MAJOR_VERSION __LIBSYCL_MAJOR_VERSION)
ginkgo_extract_dpcpp_version(${CMAKE_CXX_COMPILER} GINKGO_DPCPP_VERSION __SYCL_COMPILER_VERSION)
endif()
else()
set(GINKGO_DPCPP_MAJOR_VERSION "0")
endif()
Expand Down Expand Up @@ -453,10 +459,10 @@ if(NOT "${CMAKE_GENERATOR_TOOLSET}" STREQUAL "")
endif()
add_custom_target(test_install
COMMAND ${CMAKE_COMMAND} -G${CMAKE_GENERATOR} ${TOOLSET}
-S${GINKGO_TEST_INSTALL_SRC_DIR}
-H${GINKGO_TEST_INSTALL_SRC_DIR}
-B${GINKGO_TEST_INSTALL_BIN_DIR}
-DCMAKE_BUILD_TYPE=${CMAKE_BUILD_TYPE}
-DCMAKE_PREFIX_PATH=${CMAKE_INSTALL_PREFIX}
-DCMAKE_PREFIX_PATH=${CMAKE_INSTALL_PREFIX}/${GINKGO_INSTALL_CONFIG_DIR}
-DCMAKE_CXX_COMPILER=${CMAKE_CXX_COMPILER}
-DCMAKE_C_COMPILER=${CMAKE_C_COMPILER}
-DCMAKE_CUDA_COMPILER=${CMAKE_CUDA_COMPILER}
Expand All @@ -476,7 +482,7 @@ add_custom_target(test_install

add_custom_target(test_exportbuild
COMMAND ${CMAKE_COMMAND} -G${CMAKE_GENERATOR} ${TOOLSET}
-S${GINKGO_TEST_EXPORTBUILD_SRC_DIR}
-H${GINKGO_TEST_EXPORTBUILD_SRC_DIR}
-B${GINKGO_TEST_EXPORTBUILD_BIN_DIR}
-DCMAKE_CXX_COMPILER=${CMAKE_CXX_COMPILER}
-DCMAKE_C_COMPILER=${CMAKE_C_COMPILER}
Expand All @@ -494,7 +500,7 @@ add_custom_target(test_exportbuild

add_custom_target(test_pkgconfig
COMMAND ${CMAKE_COMMAND} -G${CMAKE_GENERATOR} ${TOOLSET}
-S${GINKGO_TEST_PKGCONFIG_SRC_DIR}
-H${GINKGO_TEST_PKGCONFIG_SRC_DIR}
-B${GINKGO_TEST_PKGCONFIG_BIN_DIR}
-DCMAKE_CXX_COMPILER=${CMAKE_CXX_COMPILER}
-DCMAKE_C_COMPILER=${CMAKE_C_COMPILER}
Expand Down
19 changes: 9 additions & 10 deletions common/cuda_hip/matrix/batch_csr_kernels.hpp.inc
Original file line number Diff line number Diff line change
Expand Up @@ -296,12 +296,12 @@ __device__ __forceinline__ void csr_advanced_matvec_kernel(

template <typename ValueType>
__global__
__launch_bounds__(default_block_size, sm_multiplier) void advanced_spmv(
const batch_dense::UniformBatch<const ValueType> alpha,
const gko::batch_csr::UniformBatch<const ValueType> a,
const batch_dense::UniformBatch<const ValueType> b,
const batch_dense::UniformBatch<const ValueType> beta,
const batch_dense::UniformBatch<ValueType> c)
__launch_bounds__(default_block_size, sm_multiplier) void advanced_spmv(
const batch_dense::UniformBatch<const ValueType> alpha,
const gko::batch_csr::UniformBatch<const ValueType> a,
const batch_dense::UniformBatch<const ValueType> b,
const batch_dense::UniformBatch<const ValueType> beta,
const batch_dense::UniformBatch<ValueType> c)
{
for (size_type ibatch = blockIdx.x; ibatch < a.num_batch;
ibatch += gridDim.x) {
Expand Down Expand Up @@ -466,10 +466,9 @@ __global__ void uniform_convert_to_batch_dense(


__global__
__launch_bounds__(default_block_size) void check_all_diagonal_locations(
const int min_rows_cols, const int* const __restrict__ row_ptrs,
const int* const __restrict__ col_idxs,
bool* const __restrict__ all_diags)
__launch_bounds__(default_block_size) void check_all_diagonal_locations(
const int min_rows_cols, const int* const __restrict__ row_ptrs,
const int* const __restrict__ col_idxs, bool* const __restrict__ all_diags)
{
constexpr auto warp_size = config::warp_size;
const auto tile =
Expand Down
18 changes: 9 additions & 9 deletions common/cuda_hip/matrix/batch_dense_kernels.hpp.inc
Original file line number Diff line number Diff line change
Expand Up @@ -119,12 +119,12 @@ __device__ __forceinline__ void single_advanced_matvec_kernel(

template <typename ValueType>
__global__
__launch_bounds__(default_block_size, sm_multiplier) void advanced_mv(
const gko::batch_dense::UniformBatch<const ValueType> alpha,
const gko::batch_dense::UniformBatch<const ValueType> a,
const gko::batch_dense::UniformBatch<const ValueType> b,
const gko::batch_dense::UniformBatch<const ValueType> beta,
const gko::batch_dense::UniformBatch<ValueType> c)
__launch_bounds__(default_block_size, sm_multiplier) void advanced_mv(
const gko::batch_dense::UniformBatch<const ValueType> alpha,
const gko::batch_dense::UniformBatch<const ValueType> a,
const gko::batch_dense::UniformBatch<const ValueType> b,
const gko::batch_dense::UniformBatch<const ValueType> beta,
const gko::batch_dense::UniformBatch<ValueType> c)
{
for (size_type ibatch = blockIdx.x; ibatch < a.num_batch;
ibatch += gridDim.x) {
Expand Down Expand Up @@ -308,9 +308,9 @@ __device__ __forceinline__ void compute_norm2(

template <typename ValueType>
__global__
__launch_bounds__(default_block_size, sm_multiplier) void compute_norm2(
const gko::batch_dense::UniformBatch<const ValueType> x,
const gko::batch_dense::UniformBatch<remove_complex<ValueType>> result)
__launch_bounds__(default_block_size, sm_multiplier) void compute_norm2(
const gko::batch_dense::UniformBatch<const ValueType> x,
const gko::batch_dense::UniformBatch<remove_complex<ValueType>> result)
{
for (size_type ibatch = blockIdx.x; ibatch < x.num_batch;
ibatch += gridDim.x) {
Expand Down
12 changes: 6 additions & 6 deletions common/cuda_hip/matrix/batch_ell_kernels.hpp.inc
Original file line number Diff line number Diff line change
Expand Up @@ -184,12 +184,12 @@ __global__ __launch_bounds__(default_block_size, sm_multiplier) void spmv(

template <typename ValueType>
__global__
__launch_bounds__(default_block_size, sm_multiplier) void advanced_spmv(
const batch_dense::UniformBatch<const ValueType> alpha,
const gko::batch_ell::UniformBatch<const ValueType> a,
const batch_dense::UniformBatch<const ValueType> b,
const batch_dense::UniformBatch<const ValueType> beta,
const batch_dense::UniformBatch<ValueType> c)
__launch_bounds__(default_block_size, sm_multiplier) void advanced_spmv(
const batch_dense::UniformBatch<const ValueType> alpha,
const gko::batch_ell::UniformBatch<const ValueType> a,
const batch_dense::UniformBatch<const ValueType> b,
const batch_dense::UniformBatch<const ValueType> beta,
const batch_dense::UniformBatch<ValueType> c)
{
for (size_type ibatch = blockIdx.x; ibatch < a.num_batch;
ibatch += gridDim.x) {
Expand Down
26 changes: 13 additions & 13 deletions common/cuda_hip/preconditioner/batch_ilu_kernels.hpp.inc
Original file line number Diff line number Diff line change
Expand Up @@ -112,12 +112,12 @@ __device__ __forceinline__ void modify_rows_below_curr_row(

template <typename ValueType>
__global__
__launch_bounds__(default_block_size) void generate_exact_ilu0_kernel(
const size_type batch_size, const int nrows, const int nnz,
const int* const __restrict__ diag_ptrs,
const int* const __restrict__ mat_row_ptrs,
const int* const __restrict__ mat_col_idxs,
ValueType* const __restrict__ mat_values)
__launch_bounds__(default_block_size) void generate_exact_ilu0_kernel(
const size_type batch_size, const int nrows, const int nnz,
const int* const __restrict__ diag_ptrs,
const int* const __restrict__ mat_row_ptrs,
const int* const __restrict__ mat_col_idxs,
ValueType* const __restrict__ mat_values)
{
for (size_type batch_id = blockIdx.x; batch_id < batch_size;
batch_id += gridDim.x) {
Expand Down Expand Up @@ -211,13 +211,13 @@ __global__ __launch_bounds__(default_block_size) void generate_parilu0_kernel(
}

__global__
__launch_bounds__(default_block_size) void generate_common_pattern_to_fill_L_and_U(
const int nrows, const int* const __restrict__ row_ptrs,
const int* const __restrict__ col_idxs,
const int* const __restrict__ row_ptrs_L,
const int* const __restrict__ row_ptrs_U,
int* const __restrict__ L_col_holders,
int* const __restrict__ U_col_holders)
__launch_bounds__(default_block_size) void generate_common_pattern_to_fill_L_and_U(
const int nrows, const int* const __restrict__ row_ptrs,
const int* const __restrict__ col_idxs,
const int* const __restrict__ row_ptrs_L,
const int* const __restrict__ row_ptrs_U,
int* const __restrict__ L_col_holders,
int* const __restrict__ U_col_holders)
{
constexpr int warp_size = config::warp_size;
auto tile_grp =
Expand Down
77 changes: 37 additions & 40 deletions common/cuda_hip/preconditioner/batch_isai_kernels.hpp.inc
Original file line number Diff line number Diff line change
Expand Up @@ -32,13 +32,13 @@

template <int subwarp_size>
__global__
__launch_bounds__(default_block_size) void extract_dense_linear_sys_pattern_kernel(
const int nrows, const int* const __restrict__ A_row_ptrs,
const int* const __restrict__ A_col_idxs,
const int* const __restrict__ aiA_row_ptrs,
const int* const __restrict__ aiA_col_idxs,
int* const dense_mat_pattern, int* const rhs_one_idxs, int* const sizes,
int* num_matches_per_row_for_each_csr_sys)
__launch_bounds__(default_block_size) void extract_dense_linear_sys_pattern_kernel(
const int nrows, const int* const __restrict__ A_row_ptrs,
const int* const __restrict__ A_col_idxs,
const int* const __restrict__ aiA_row_ptrs,
const int* const __restrict__ aiA_col_idxs, int* const dense_mat_pattern,
int* const rhs_one_idxs, int* const sizes,
int* num_matches_per_row_for_each_csr_sys)
{
using gko::preconditioner::batch_isai::row_size_limit;
// assert(subwarp_size >= row_size_limit); //Not required here
Expand Down Expand Up @@ -312,16 +312,15 @@

template <int subwarp_size, typename ValueType>
__global__
__launch_bounds__(default_block_size) void fill_values_dense_mat_and_solve_kernel(
const int nbatch, const int nrows, const int A_nnz,
const ValueType* const A_values, const int aiA_nnz,
const int* const __restrict__ aiA_row_ptrs,
ValueType* const __restrict__ aiA_values,
const int* const __restrict__ dense_mat_pattern,
const int* const __restrict__ rhs_one_idxs,
const int* const __restrict__ sizes,
const enum gko::preconditioner::batch_isai_input_matrix_type
matrix_type)
__launch_bounds__(default_block_size) void fill_values_dense_mat_and_solve_kernel(
const int nbatch, const int nrows, const int A_nnz,
const ValueType* const A_values, const int aiA_nnz,
const int* const __restrict__ aiA_row_ptrs,
ValueType* const __restrict__ aiA_values,
const int* const __restrict__ dense_mat_pattern,
const int* const __restrict__ rhs_one_idxs,
const int* const __restrict__ sizes,
const enum gko::preconditioner::batch_isai_input_matrix_type matrix_type)
{
using gko::preconditioner::batch_isai::row_size_limit;
static_assert(row_size_limit <= subwarp_size, "incompatible subwarp size");
Expand Down Expand Up @@ -374,13 +373,13 @@

if (matrix_type == gko::preconditioner::batch_isai_input_matrix_type::
lower_tri) // input matrix: lower_tri =>
// tranposed system: uppper_tri

Check warning on line 376 in common/cuda_hip/preconditioner/batch_isai_kernels.hpp.inc

View workflow job for this annotation

GitHub Actions / Spell Check with Typos

"tranposed" should be "transposed".

Check warning on line 376 in common/cuda_hip/preconditioner/batch_isai_kernels.hpp.inc

View workflow job for this annotation

GitHub Actions / Spell Check with Typos

"uppper" should be "upper".
{
sol = solve_upper_tri_dense_system(subwarpgrp, size, local_row,
rhs_one_idx);
} else if (matrix_type ==
gko::preconditioner::batch_isai_input_matrix_type::
upper_tri) // input matrix: upper_tri => tranposed

Check warning on line 382 in common/cuda_hip/preconditioner/batch_isai_kernels.hpp.inc

View workflow job for this annotation

GitHub Actions / Spell Check with Typos

"tranposed" should be "transposed".
// system: lower_tri
{
sol = solve_lower_tri_dense_system(subwarpgrp, size, local_row,
Expand Down Expand Up @@ -436,16 +435,16 @@

template <typename ValueType>
__global__
__launch_bounds__(default_block_size) void extract_csr_sys_pattern_kernel(
const int lin_sys_row, const int* const __restrict__ inv_row_ptrs,
const int* const __restrict__ inv_col_idxs,
const int* const __restrict__ sys_row_ptrs,
const int* const __restrict__ sys_col_idxs,
const int* const __restrict__ csr_pattern_row_ptrs,
int* const __restrict__ csr_pattern_col_idxs,
gko::remove_complex<ValueType>* const __restrict__ csr_pattern_values)
__launch_bounds__(default_block_size) void extract_csr_sys_pattern_kernel(
const int lin_sys_row, const int* const __restrict__ inv_row_ptrs,
const int* const __restrict__ inv_col_idxs,
const int* const __restrict__ sys_row_ptrs,
const int* const __restrict__ sys_col_idxs,
const int* const __restrict__ csr_pattern_row_ptrs,
int* const __restrict__ csr_pattern_col_idxs,
gko::remove_complex<ValueType>* const __restrict__ csr_pattern_values)
{
// use one thread per match of the 2 arrays (non-coalseced accesses but data

Check warning on line 447 in common/cuda_hip/preconditioner/batch_isai_kernels.hpp.inc

View workflow job for this annotation

GitHub Actions / Spell Check with Typos

"coalseced" should be "coalesced".
// locality)
const int gid = threadIdx.x + blockIdx.x * blockDim.x;

Expand All @@ -470,12 +469,11 @@

template <typename ValueType>
__global__
__launch_bounds__(default_block_size) void fill_batch_csr_system_kernel(
const int nbatch, const int csr_nnz,
const gko::remove_complex<
ValueType>* const __restrict__ csr_pattern_values,
const int sys_nnz, const ValueType* const __restrict__ sys_csr_values,
ValueType* const __restrict__ batch_csr_mats_values)
__launch_bounds__(default_block_size) void fill_batch_csr_system_kernel(
const int nbatch, const int csr_nnz,
const gko::remove_complex<ValueType>* const __restrict__ csr_pattern_values,
const int sys_nnz, const ValueType* const __restrict__ sys_csr_values,
ValueType* const __restrict__ batch_csr_mats_values)
{
const int gid = threadIdx.x + blockDim.x * blockIdx.x;

Expand All @@ -494,10 +492,9 @@

template <typename ValueType>
__global__
__launch_bounds__(default_block_size) void initialize_b_and_x_vectors_kernel(
const int nbatch, const int size, const int rhs_one_idx,
ValueType* const __restrict__ b_vals,
ValueType* const __restrict__ x_vals)
__launch_bounds__(default_block_size) void initialize_b_and_x_vectors_kernel(
const int nbatch, const int size, const int rhs_one_idx,
ValueType* const __restrict__ b_vals, ValueType* const __restrict__ x_vals)
{
const int gid = threadIdx.x + blockDim.x * blockIdx.x;

Expand All @@ -515,11 +512,11 @@

template <typename ValueType>
__global__
__launch_bounds__(default_block_size) void write_large_sys_solution_to_inverse_kernel(
const int nbatch, const int lin_sys_row, const int size,
const ValueType* const __restrict__ x_vals, const int inv_nnz,
const int* const __restrict__ inv_row_ptrs,
ValueType* const __restrict__ inv_vals)
__launch_bounds__(default_block_size) void write_large_sys_solution_to_inverse_kernel(
const int nbatch, const int lin_sys_row, const int size,
const ValueType* const __restrict__ x_vals, const int inv_nnz,
const int* const __restrict__ inv_row_ptrs,
ValueType* const __restrict__ inv_vals)
{
const int gid = threadIdx.x + blockDim.x * blockIdx.x;
assert(size == inv_row_ptrs[lin_sys_row + 1] - inv_row_ptrs[lin_sys_row]);
Expand Down
Loading
Loading