Added ParILU for CUDA #324

thoasm · 2019-07-05T16:14:41Z

This PR adds the necessary kernels for the CUDA executor to use ParIlu.

I set the default iterations for the solving to 5, although I am not sure if that is actually the best value.

As part of this PR, the test matrices were put into a separate folder with a header file describing its location. This was done so there is no need to replicate the same matrix multiple times.

Note:

Csr Matrix sorting is not part of this PR, and is currently not supported for the CUDA executor!
The CUDA implementation is tested with the exact same matrix as the OpenMP version, which we might change.

thoasm · 2019-07-07T20:39:33Z

I found out why the tests performed so slowly for the CUDA kernels:
Printing the reference, cuda and the difference matrix took forever. The kernels were completely fine.

Now, we use ani4 instead of ani5 as a test matrix (ani4 still has 3081 rows and columns, so it should be fine), which results in a faster test in general (6572 ms on my system) since the matrix got smaller.
Additionally, I increased the iteration count (the default one and the test case where we expect the solution to be almost exactly the ILU) for CUDA, so the tests do not fail.

We can also use the same matrix (ani4) for the OpenMP test, which would increase the time from originally 28 ms to 4000 ms (requires an increase of the default iterations from 3 to 4 on my system to pass).

The question is: What do we consider an acceptable deviation, therefore, what do we set as the default iterations count?
Currently, we aim at 5e-3, but that was arbitrarily set (by me, so we could pass ani1 fairly closely with 3 iterations).

codecov · 2019-07-07T21:11:46Z

Codecov Report

Merging #324 into develop will decrease coverage by 0.01%.
The diff coverage is 95.65%.

@@             Coverage Diff             @@
##           develop     #324      +/-   ##
===========================================
- Coverage    98.25%   98.24%   -0.02%     
===========================================
  Files          224      225       +1     
  Lines        17458    17561     +103     
===========================================
+ Hits         17153    17252      +99     
- Misses         305      309       +4

Impacted Files	Coverage Δ
include/ginkgo/core/base/math.hpp	`93.1% <0%> (-6.9%)`	⬇️
reference/factorization/par_ilu_kernels.cpp	`100% <100%> (ø)`	⬆️
omp/factorization/par_ilu_kernels.cpp	`100% <100%> (ø)`	⬆️
omp/test/factorization/par_ilu_kernels.cpp	`97.89% <75%> (ø)`	⬆️
cuda/test/factorization/par_ilu_kernels.cpp	`97.84% <97.84%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update b43c985...664d978. Read the comment docs.

thoasm · 2019-07-08T22:22:49Z

I added the infinity and NaN check that was requested and lowered the iteration count for the CUDA compute kernel (while also lowering the expected test accuracy).
@hartwiganzt : Can you check if that is what you expected, and if I should also add this to the reference and the OpenMP implementation?

hartwiganzt · 2019-07-09T07:18:07Z

cuda/test/factorization/par_ilu_kernels.cpp

+
+    compute_lu(&l_ref, &u_ref, &l_cuda, &u_cuda, iterations);
+
+    GKO_ASSERT_MTX_NEAR(l_ref, l_cuda, 1e-14);


This likely won't be achieved.

I tested it, and it does because it is doing 200 iterations.

hartwiganzt · 2019-07-09T07:18:45Z

cuda/test/factorization/par_ilu_kernels.cpp

+
+    compute_lu(&l_ref, &u_ref, &l_cuda, &u_cuda);
+
+    GKO_ASSERT_MTX_NEAR(l_ref, l_cuda, 5e-2);


Can't tell whether this is a good threshold - maybe check this.

I chose it because it worked on my system, I still have to check it on the CI system.

hartwiganzt · 2019-07-09T07:19:31Z

Yes - this is the check I wanted. Not sure whether "is_inf_nan" or"is_nan_inf" is the better naming.
Also: did you happen to do a quick runtime comparison - what is the performance penalty we pay for this check?

pratikvn

LGTM! I guess the default values for number of iterations and the threshold still needs to be finalized ?

include/ginkgo/core/base/math.hpp

thoasm · 2019-07-09T13:51:37Z

I did some checks on the CI system with ani5.mtx, but I don't see a lot of differences. I am using 150 iterations and will list multiple runs below (given time is the total time for all 150 kernel calls):

with check: 8345204 ns  8320647 ns  8327829 ns
  no check: 8315675 ns  8335737 ns  8329883 ns

So we can definitively leave it in.

I will now do pretty much the same for OpenMP and Reference, while removing the compile error we have for just one compiler combination.

Stale

hartwiganzt · 2019-07-09T14:11:44Z

What happens if you enter 100 iterations in the unit test? with the nan-inf protection it should still converge to the right solution (just not overwrite the results)

thoasm · 2019-07-09T14:15:32Z

Actually, that does not matter at all in our tests. In all that I did, it never detected any NaN or inf values.
I might actually have to add a test for reference to check if it actually does what it should, but for that, I need the implementation first.

thoasm · 2019-07-09T16:06:44Z

@hartwiganzt I am not quite done with testing. The current version should work, but I need to make sure it works.

thoasm · 2019-07-11T13:56:07Z

Update: It finally builds on all machines (hopefully the tests are successful as well).
The current fix (especially for clang) is not ideal, however, I have no idea how to improve it.
The clang-fix is the only one that might have a runtime impact, however, I think that it should be negligible.

This code is now ready for review.

pratikvn

LGTM!

include/ginkgo/core/base/math.hpp

tcojean

LGTM. I only have some minor comments.

include/ginkgo/core/base/math.hpp

cuda/factorization/par_ilu_kernels.cu

Additionally, separated the storage of the matrices to a single location to prevent matrix file duplication between CUDA and OpenMP tests.

Additionally, added more iterations for the default CUDA compute kernel for ParILU.

Also reduced default number of iterations for CUDA compute kernel

Additionally, raised the allowed tolerance for OpenMP compute kernel because on all cores, the CI system fails otherwise.

Also used for the NaN inf check the appropriate function: `isfinite`

The Ginkgo team is proud to announce the new minor release of Ginkgo version 1.1.0. This release brings several performance improvements, adds Windows support, adds support for factorizations inside Ginkgo and a new ILU preconditioner based on ParILU algorithm, among other things. For detailed information, check the respective issue. Supported systems and requirements: + For all platforms, cmake 3.9+ + Linux and MacOS + gcc: 5.3+, 6.3+, 7.3+, 8.1+ + clang: 3.9+ + Intel compiler: 2017+ + Apple LLVM: 8.0+ + CUDA module: CUDA 9.0+ + Windows + MinGW and CygWin: gcc 5.3+, 6.3+, 7.3+, 8.1+ + Microsoft Visual Studio: VS 2017 15.7+ + CUDA module: CUDA 9.0+, Microsoft Visual Studio + OpenMP module: MinGW or CygWin. The current known issues can be found in the [known issues page](https://github.com/ginkgo-project/ginkgo/wiki/Known-Issues). Additions: + Upper and lower triangular solvers ([#327](#327), [#336](#336), [#341](#341), [#342](#342)) + New factorization support in Ginkgo, and addition of the ParILU algorithm ([#305](#305), [#315](#315), [#319](#319), [#324](#324)) + New ILU preconditioner ([#348](#348), [#353](#353)) + Windows MinGW and Cygwin support ([#347](#347)) + Windows Visual studio support ([#351](#351)) + New example showing how to use ParILU as a preconditioner ([#358](#358)) + New example on using loggers for debugging ([#360](#360)) + Add two new 9pt and 27pt stencil examples ([#300](#300), [#306](#306)) + Allow benchmarking CuSPARSE spmv formats through Ginkgo's benchmarks ([#303](#303)) + New benchmark for sparse matrix format conversions ([#312](#312)) + Add conversions between CSR and Hybrid formats ([#302](#302), [#310](#310)) + Support for sorting rows in the CSR format by column idices ([#322](#322)) + Addition of a CUDA COO SpMM kernel for improved performance ([#345](#345)) + Addition of a LinOp to handle perturbations of the form (identity + scalar * basis * projector) ([#334](#334)) + New sparsity matrix representation format with Reference and OpenMP kernels ([#349](#349), [#350](#350)) Fixes: + Accelerate GMRES solver for CUDA executor ([#363](#363)) + Fix BiCGSTAB solver convergence ([#359](#359)) + Fix CGS logging by reporting the residual for every sub iteration ([#328](#328)) + Fix CSR,Dense->Sellp conversion's memory access violation ([#295](#295)) + Accelerate CSR->Ell,Hybrid conversions on CUDA ([#313](#313), [#318](#318)) + Fixed slowdown of COO SpMV on OpenMP ([#340](#340)) + Fix gcc 6.4.0 internal compiler error ([#316](#316)) + Fix compilation issue on Apple clang++ 10 ([#322](#322)) + Make Ginkgo able to compile on Intel 2017 and above ([#337](#337)) + Make the benchmarks spmv/solver use the same matrix formats ([#366](#366)) + Fix self-written isfinite function ([#348](#348)) + Fix Jacobi issues shown by cuda-memcheck Tools and ecosystem: + Multiple improvements to the CI system and tools ([#296](#296), [#311](#311), [#365](#365)) + Multiple improvements to the Ginkgo containers ([#328](#328), [#361](#361)) + Add sonarqube analysis to Ginkgo ([#304](#304), [#308](#308), [#309](#309)) + Add clang-tidy and iwyu support to Ginkgo ([#298](#298)) + Improve Ginkgo's support of xSDK M12 policy by adding the `TPL_` arguments to CMake ([#300](#300)) + Add support for the xSDK R7 policy ([#325](#325)) + Fix examples in html documentation ([#367](#367))

The Ginkgo team is proud to announce the new minor release of Ginkgo version 1.1.0. This release brings several performance improvements, adds Windows support, adds support for factorizations inside Ginkgo and a new ILU preconditioner based on ParILU algorithm, among other things. For detailed information, check the respective issue. Supported systems and requirements: + For all platforms, cmake 3.9+ + Linux and MacOS + gcc: 5.3+, 6.3+, 7.3+, 8.1+ + clang: 3.9+ + Intel compiler: 2017+ + Apple LLVM: 8.0+ + CUDA module: CUDA 9.0+ + Windows + MinGW and Cygwin: gcc 5.3+, 6.3+, 7.3+, 8.1+ + Microsoft Visual Studio: VS 2017 15.7+ + CUDA module: CUDA 9.0+, Microsoft Visual Studio + OpenMP module: MinGW or Cygwin. The current known issues can be found in the [known issues page](https://github.com/ginkgo-project/ginkgo/wiki/Known-Issues). ### Additions + Upper and lower triangular solvers ([#327](#327), [#336](#336), [#341](#341), [#342](#342)) + New factorization support in Ginkgo, and addition of the ParILU algorithm ([#305](#305), [#315](#315), [#319](#319), [#324](#324)) + New ILU preconditioner ([#348](#348), [#353](#353)) + Windows MinGW and Cygwin support ([#347](#347)) + Windows Visual Studio support ([#351](#351)) + New example showing how to use ParILU as a preconditioner ([#358](#358)) + New example on using loggers for debugging ([#360](#360)) + Add two new 9pt and 27pt stencil examples ([#300](#300), [#306](#306)) + Allow benchmarking CuSPARSE spmv formats through Ginkgo's benchmarks ([#303](#303)) + New benchmark for sparse matrix format conversions ([#312](#312)) + Add conversions between CSR and Hybrid formats ([#302](#302), [#310](#310)) + Support for sorting rows in the CSR format by column idices ([#322](#322)) + Addition of a CUDA COO SpMM kernel for improved performance ([#345](#345)) + Addition of a LinOp to handle perturbations of the form (identity + scalar * basis * projector) ([#334](#334)) + New sparsity matrix representation format with Reference and OpenMP kernels ([#349](#349), [#350](#350)) ### Fixes + Accelerate GMRES solver for CUDA executor ([#363](#363)) + Fix BiCGSTAB solver convergence ([#359](#359)) + Fix CGS logging by reporting the residual for every sub iteration ([#328](#328)) + Fix CSR,Dense->Sellp conversion's memory access violation ([#295](#295)) + Accelerate CSR->Ell,Hybrid conversions on CUDA ([#313](#313), [#318](#318)) + Fixed slowdown of COO SpMV on OpenMP ([#340](#340)) + Fix gcc 6.4.0 internal compiler error ([#316](#316)) + Fix compilation issue on Apple clang++ 10 ([#322](#322)) + Make Ginkgo able to compile on Intel 2017 and above ([#337](#337)) + Make the benchmarks spmv/solver use the same matrix formats ([#366](#366)) + Fix self-written isfinite function ([#348](#348)) + Fix Jacobi issues shown by cuda-memcheck ### Tools and ecosystem improvements + Multiple improvements to the CI system and tools ([#296](#296), [#311](#311), [#365](#365)) + Multiple improvements to the Ginkgo containers ([#328](#328), [#361](#361)) + Add sonarqube analysis to Ginkgo ([#304](#304), [#308](#308), [#309](#309)) + Add clang-tidy and iwyu support to Ginkgo ([#298](#298)) + Improve Ginkgo's support of xSDK M12 policy by adding the `TPL_` arguments to CMake ([#300](#300)) + Add support for the xSDK R7 policy ([#325](#325)) + Fix examples in html documentation ([#367](#367)) Related PR: #370

thoasm added mod:cuda This is related to the CUDA module. type:preconditioner This is related to the preconditioners 1:ST:ready-for-review This PR is ready for review labels Jul 5, 2019

thoasm requested review from pratikvn, hartwiganzt and tcojean July 5, 2019 16:14

thoasm self-assigned this Jul 5, 2019

thoasm force-pushed the parilu_cuda branch from 47dad81 to b6af08c Compare July 7, 2019 20:20

thoasm added 1:ST:WIP This PR is a work in progress. Not ready for review. and removed 1:ST:ready-for-review This PR is ready for review labels Jul 8, 2019

hartwiganzt reviewed Jul 9, 2019

View reviewed changes

pratikvn previously approved these changes Jul 9, 2019

View reviewed changes

include/ginkgo/core/base/math.hpp Outdated Show resolved Hide resolved

hartwiganzt approved these changes Jul 9, 2019

View reviewed changes

thoasm force-pushed the parilu_cuda branch 3 times, most recently from cb08a7b to ff7eab0 Compare July 11, 2019 13:33

thoasm added 1:ST:ready-for-review This PR is ready for review and removed 1:ST:WIP This PR is a work in progress. Not ready for review. labels Jul 11, 2019

pratikvn approved these changes Jul 11, 2019

View reviewed changes

include/ginkgo/core/base/math.hpp Outdated Show resolved Hide resolved

include/ginkgo/core/base/math.hpp Show resolved Hide resolved

tcojean approved these changes Jul 12, 2019

View reviewed changes

include/ginkgo/core/base/math.hpp Outdated Show resolved Hide resolved

cuda/factorization/par_ilu_kernels.cu Outdated Show resolved Hide resolved

Thomas Grützmacher added 11 commits July 12, 2019 14:13

Started implementing CUDA kernels for ParILU

0ac57a8

Added CUDA test for ParILU

12dd8d7

Additionally, separated the storage of the matrices to a single location to prevent matrix file duplication between CUDA and OpenMP tests.

Fixed kernel and test mistake in CUDA, ParILU

922d064

Added initialize kernel for ParILU on CUDA

8052ac0

Added remaining kernel (compute) for CUDA ParILU

a2348c0

Added test matrix to use it in CUDA ParILU test

fa638ee

Additionally, added more iterations for the default CUDA compute kernel for ParILU.

Added check for inf and NaN in ParILU CUDA

2f47596

Also reduced default number of iterations for CUDA compute kernel

All ParILU compute kernels have inf & NaN checks

55fab55

Additionally, raised the allowed tolerance for OpenMP compute kernel because on all cores, the CI system fails otherwise.

Removed compile issues for older CUDA compilers

2f4dba5

Also used for the NaN inf check the appropriate function: `isfinite`

ParIlu: Fix for the clang compiler compiler

ca5129e

Fixed typo in comment

3f25c9f

thoasm force-pushed the parilu_cuda branch from 69de572 to a5496b5 Compare July 12, 2019 12:16

Simplified the code a bit

664d978

thoasm force-pushed the parilu_cuda branch from a5496b5 to 664d978 Compare July 12, 2019 12:30

tcojean added 1:ST:ready-to-merge This PR is ready to merge. and removed 1:ST:ready-for-review This PR is ready for review labels Jul 12, 2019

thoasm merged commit 260f6ac into develop Jul 12, 2019

thoasm deleted the parilu_cuda branch August 5, 2019 10:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added ParILU for CUDA #324

Added ParILU for CUDA #324

thoasm commented Jul 5, 2019

thoasm commented Jul 7, 2019

codecov bot commented Jul 7, 2019 •

edited

Loading

thoasm commented Jul 8, 2019

hartwiganzt Jul 9, 2019

thoasm Jul 9, 2019

hartwiganzt Jul 9, 2019

thoasm Jul 9, 2019

hartwiganzt commented Jul 9, 2019

pratikvn left a comment

thoasm commented Jul 9, 2019

hartwiganzt commented Jul 9, 2019

thoasm commented Jul 9, 2019

thoasm commented Jul 9, 2019

thoasm commented Jul 11, 2019 •

edited by hartwiganzt

Loading

pratikvn left a comment

tcojean left a comment


		compute_lu(&l_ref, &u_ref, &l_cuda, &u_cuda, iterations);

		GKO_ASSERT_MTX_NEAR(l_ref, l_cuda, 1e-14);


		compute_lu(&l_ref, &u_ref, &l_cuda, &u_cuda);

		GKO_ASSERT_MTX_NEAR(l_ref, l_cuda, 5e-2);

Added ParILU for CUDA #324

Added ParILU for CUDA #324

Conversation

thoasm commented Jul 5, 2019

thoasm commented Jul 7, 2019

codecov bot commented Jul 7, 2019 • edited Loading

Codecov Report

thoasm commented Jul 8, 2019

hartwiganzt Jul 9, 2019

Choose a reason for hiding this comment

thoasm Jul 9, 2019

Choose a reason for hiding this comment

hartwiganzt Jul 9, 2019

Choose a reason for hiding this comment

thoasm Jul 9, 2019

Choose a reason for hiding this comment

hartwiganzt commented Jul 9, 2019

pratikvn left a comment

Choose a reason for hiding this comment

thoasm commented Jul 9, 2019

hartwiganzt commented Jul 9, 2019

thoasm commented Jul 9, 2019

thoasm commented Jul 9, 2019

thoasm commented Jul 11, 2019 • edited by hartwiganzt Loading

pratikvn left a comment

Choose a reason for hiding this comment

tcojean left a comment

Choose a reason for hiding this comment

codecov bot commented Jul 7, 2019 •

edited

Loading

thoasm commented Jul 11, 2019 •

edited by hartwiganzt

Loading