Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve update hessenberg of gmres #363

Merged
merged 8 commits into from
Oct 15, 2019
Merged

Conversation

yhmtsai
Copy link
Member

@yhmtsai yhmtsai commented Oct 14, 2019

This PR improves the kernel update_hessenberg and remove the unneeded copy when preconditioner is identity.
It closes #364

This part focus on the update_hessenberg improvement.
Take step 1 of gmres as the target.
It uses our implementation not cublas according to profiling.

Profiling results of step 1 in gmres
The left column of table is the number of right hand side.

P100

parabolic_fem cublas implementation
4 28,874,986,570 18,591,212,314
2 17,228,621,978 16,034,896,559
1 11,732,775,039 14,406,294,158
cz20468 cublas implementation
4 3,265,137,834 1,057,310,522
2 1,950,870,235 978,492,002
1 1,324,164,163 887,645,474

V100

parabolic_fem cublas implementation
4 27,439,138,382 12,768,940,580
2 16,620,131,622 12,468,852,824
1 10,211,400,108 10,739,503,389
cz20468 cublas implementation
4 2,037,349,176 737,828,231
2 1,222,426,279 647,972,789
1 800,424,903 608,971,884

edited: cherry-pick @thoasm branch fix_copy_gmres

@yhmtsai yhmtsai added mod:cuda This is related to the CUDA module. type:solver This is related to the solvers 1:ST:ready-for-review This PR is ready for review labels Oct 14, 2019
@yhmtsai yhmtsai self-assigned this Oct 14, 2019
@tcojean tcojean added this to the v1.1.0 milestone Oct 14, 2019
@tcojean
Copy link
Member

tcojean commented Oct 14, 2019

This doesn't include @thoasm's fix, right?

@yhmtsai
Copy link
Member Author

yhmtsai commented Oct 14, 2019

@tcojean It does not include @thoasm's fix.

@tcojean
Copy link
Member

tcojean commented Oct 14, 2019

OK, since he is not available right now, maybe you should include it. You had access to his code I think, right?

Edit: actually I think it's in a branch already pushed, we could also try to create a PR from that branch.

@thoasm
Copy link
Member

thoasm commented Oct 14, 2019

@yhmtsai It would be nice if you could add it to this PR.
I kind of expected our two versions to be merged into one PR since they both aim to improve the GMRES performance. The code is in the branch fix_copy_gmres, you should be able to just rebase on top of it to merge both (or use cherry-pick).

@thoasm
Copy link
Member

thoasm commented Oct 14, 2019

@yhmtsai Also, I have one question about the left column in your benchmark tables:
Do they show the number of used right hand sides?

Either way, the results look impressive.

@yhmtsai
Copy link
Member Author

yhmtsai commented Oct 14, 2019

@thoasm Sure, I use cherry-pick your branch.
This PR is full improvement of #364 now.
Yes, the value is the number of right hand side.

@codecov
Copy link

codecov bot commented Oct 14, 2019

Codecov Report

Merging #363 into develop will decrease coverage by 0.04%.
The diff coverage is 0%.

Impacted file tree graph

@@             Coverage Diff             @@
##           develop     #363      +/-   ##
===========================================
- Coverage    90.02%   89.98%   -0.05%     
===========================================
  Files          136      136              
  Lines         6075     6078       +3     
===========================================
  Hits          5469     5469              
- Misses         606      609       +3
Impacted Files Coverage Δ
core/solver/gmres.cpp 0% <0%> (ø) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update ad6f41d...e328445. Read the comment docs.

@yhmtsai
Copy link
Member Author

yhmtsai commented Oct 14, 2019

This does not include residual computation in each loop because the computation has big impact on the detail performance.

cz20468 - nrhs 1 develop this PR
allocate 599,799,331 945,154
copy 76,167,693 455,443
free 309,834,097 884,189
step_1 1,312,517,216 780,350,410
total time 2,363,942,119 807,847,639
cz20468 - nrhs 4 develop this PR
allocate 1,412,737,360 2,513,371
copy 201,166,396 465,334
free 401,083,313 1,454,055
step_1 1,842,918,364 830,139,299
total time 3,864,238,217 872,799,092
parabolic_fem - nrhs 1 develop this PR
allocate 7,028,298,122 11,236,184
copy 1,102,546,375 800,781
free 1,038,917,547 3,245,609
step_1 26,548,088,560 9,208,901,557
total time 36,311,182,917 9,836,719,509
parabolic_fem - nrhs 4 develop this PR
allocate 26,930,615,289 35,861,530
copy 4,318,032,166 1,378,129
free 3,238,371,633 5,515,734
step_1 40,431,872,889 10,573,231,894
total time 77,478,897,484 13,275,676,156

Copy link
Member

@tcojean tcojean left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

Copy link
Member

@pratikvn pratikvn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! The performance results look really good now!

@hartwiganzt
Copy link
Collaborator

I guess we can go ahead and merge this?

@yhmtsai yhmtsai removed the 1:ST:ready-for-review This PR is ready for review label Oct 15, 2019
@yhmtsai yhmtsai added the 1:ST:ready-to-merge This PR is ready to merge. label Oct 15, 2019
@yhmtsai yhmtsai merged commit da35c73 into develop Oct 15, 2019
@pratikvn pratikvn deleted the improve_update_hessenberg branch October 20, 2019 11:39
tcojean added a commit that referenced this pull request Oct 20, 2019
The Ginkgo team is proud to announce the new minor release of Ginkgo version
1.1.0. This release brings several performance improvements, adds Windows support, 
adds support for factorizations inside Ginkgo and a new ILU preconditioner
based on ParILU algorithm, among other things. For detailed information, check the respective issue.

Supported systems and requirements:
+ For all platforms, cmake 3.9+
+ Linux and MacOS
  + gcc: 5.3+, 6.3+, 7.3+, 8.1+
  + clang: 3.9+
  + Intel compiler: 2017+
  + Apple LLVM: 8.0+
  + CUDA module: CUDA 9.0+
+ Windows
  + MinGW and CygWin: gcc 5.3+, 6.3+, 7.3+, 8.1+
  + Microsoft Visual Studio: VS 2017 15.7+
  + CUDA module: CUDA 9.0+, Microsoft Visual Studio
  + OpenMP module: MinGW or CygWin.


The current known issues can be found in the [known issues
page](https://github.com/ginkgo-project/ginkgo/wiki/Known-Issues).


Additions:
+ Upper and lower triangular solvers ([#327](#327), [#336](#336), [#341](#341), [#342](#342)) 
+ New factorization support in Ginkgo, and addition of the ParILU
  algorithm ([#305](#305), [#315](#315), [#319](#319), [#324](#324))
+ New ILU preconditioner ([#348](#348), [#353](#353))
+ Windows MinGW and Cygwin support ([#347](#347))
+ Windows Visual studio support ([#351](#351))
+ New example showing how to use ParILU as a preconditioner ([#358](#358))
+ New example on using loggers for debugging ([#360](#360))
+ Add two new 9pt and 27pt stencil examples ([#300](#300), [#306](#306))
+ Allow benchmarking CuSPARSE spmv formats through Ginkgo's benchmarks ([#303](#303))
+ New benchmark for sparse matrix format conversions ([#312](#312))
+ Add conversions between CSR and Hybrid formats ([#302](#302), [#310](#310))
+ Support for sorting rows in the CSR format by column idices ([#322](#322))
+ Addition of a CUDA COO SpMM kernel for improved performance ([#345](#345))
+ Addition of a LinOp to handle perturbations of the form (identity + scalar *
  basis * projector) ([#334](#334))
+ New sparsity matrix representation format with Reference and OpenMP
  kernels ([#349](#349), [#350](#350))

Fixes:
+ Accelerate GMRES solver for CUDA executor ([#363](#363))
+ Fix BiCGSTAB solver convergence ([#359](#359))
+ Fix CGS logging by reporting the residual for every sub iteration ([#328](#328))
+ Fix CSR,Dense->Sellp conversion's memory access violation ([#295](#295))
+ Accelerate CSR->Ell,Hybrid conversions on CUDA ([#313](#313), [#318](#318))
+ Fixed slowdown of COO SpMV on OpenMP ([#340](#340))
+ Fix gcc 6.4.0 internal compiler error ([#316](#316))
+ Fix compilation issue on Apple clang++ 10 ([#322](#322))
+ Make Ginkgo able to compile on Intel 2017 and above ([#337](#337))
+ Make the benchmarks spmv/solver use the same matrix formats ([#366](#366))
+ Fix self-written isfinite function ([#348](#348))
+ Fix Jacobi issues shown by cuda-memcheck

Tools and ecosystem:
+ Multiple improvements to the CI system and tools ([#296](#296), [#311](#311), [#365](#365))
+ Multiple improvements to the Ginkgo containers ([#328](#328), [#361](#361))
+ Add sonarqube analysis to Ginkgo ([#304](#304), [#308](#308), [#309](#309))
+ Add clang-tidy and iwyu support to Ginkgo ([#298](#298))
+ Improve Ginkgo's support of xSDK M12 policy by adding the `TPL_` arguments
  to CMake ([#300](#300))
+ Add support for the xSDK R7 policy ([#325](#325))
+ Fix examples in html documentation ([#367](#367))
tcojean added a commit that referenced this pull request Oct 21, 2019
The Ginkgo team is proud to announce the new minor release of Ginkgo version
1.1.0. This release brings several performance improvements, adds Windows support,
adds support for factorizations inside Ginkgo and a new ILU preconditioner
based on ParILU algorithm, among other things. For detailed information, check the respective issue.

Supported systems and requirements:
+ For all platforms, cmake 3.9+
+ Linux and MacOS
  + gcc: 5.3+, 6.3+, 7.3+, 8.1+
  + clang: 3.9+
  + Intel compiler: 2017+
  + Apple LLVM: 8.0+
  + CUDA module: CUDA 9.0+
+ Windows
  + MinGW and Cygwin: gcc 5.3+, 6.3+, 7.3+, 8.1+
  + Microsoft Visual Studio: VS 2017 15.7+
  + CUDA module: CUDA 9.0+, Microsoft Visual Studio
  + OpenMP module: MinGW or Cygwin.


The current known issues can be found in the [known issues
page](https://github.com/ginkgo-project/ginkgo/wiki/Known-Issues).


### Additions
+ Upper and lower triangular solvers ([#327](#327), [#336](#336), [#341](#341), [#342](#342)) 
+ New factorization support in Ginkgo, and addition of the ParILU
  algorithm ([#305](#305), [#315](#315), [#319](#319), [#324](#324))
+ New ILU preconditioner ([#348](#348), [#353](#353))
+ Windows MinGW and Cygwin support ([#347](#347))
+ Windows Visual Studio support ([#351](#351))
+ New example showing how to use ParILU as a preconditioner ([#358](#358))
+ New example on using loggers for debugging ([#360](#360))
+ Add two new 9pt and 27pt stencil examples ([#300](#300), [#306](#306))
+ Allow benchmarking CuSPARSE spmv formats through Ginkgo's benchmarks ([#303](#303))
+ New benchmark for sparse matrix format conversions ([#312](#312))
+ Add conversions between CSR and Hybrid formats ([#302](#302), [#310](#310))
+ Support for sorting rows in the CSR format by column idices ([#322](#322))
+ Addition of a CUDA COO SpMM kernel for improved performance ([#345](#345))
+ Addition of a LinOp to handle perturbations of the form (identity + scalar *
  basis * projector) ([#334](#334))
+ New sparsity matrix representation format with Reference and OpenMP
  kernels ([#349](#349), [#350](#350))

### Fixes
+ Accelerate GMRES solver for CUDA executor ([#363](#363))
+ Fix BiCGSTAB solver convergence ([#359](#359))
+ Fix CGS logging by reporting the residual for every sub iteration ([#328](#328))
+ Fix CSR,Dense->Sellp conversion's memory access violation ([#295](#295))
+ Accelerate CSR->Ell,Hybrid conversions on CUDA ([#313](#313), [#318](#318))
+ Fixed slowdown of COO SpMV on OpenMP ([#340](#340))
+ Fix gcc 6.4.0 internal compiler error ([#316](#316))
+ Fix compilation issue on Apple clang++ 10 ([#322](#322))
+ Make Ginkgo able to compile on Intel 2017 and above ([#337](#337))
+ Make the benchmarks spmv/solver use the same matrix formats ([#366](#366))
+ Fix self-written isfinite function ([#348](#348))
+ Fix Jacobi issues shown by cuda-memcheck

### Tools and ecosystem improvements
+ Multiple improvements to the CI system and tools ([#296](#296), [#311](#311), [#365](#365))
+ Multiple improvements to the Ginkgo containers ([#328](#328), [#361](#361))
+ Add sonarqube analysis to Ginkgo ([#304](#304), [#308](#308), [#309](#309))
+ Add clang-tidy and iwyu support to Ginkgo ([#298](#298))
+ Improve Ginkgo's support of xSDK M12 policy by adding the `TPL_` arguments
  to CMake ([#300](#300))
+ Add support for the xSDK R7 policy ([#325](#325))
+ Fix examples in html documentation ([#367](#367))


Related PR: #370
@yhmtsai yhmtsai mentioned this pull request May 17, 2022
3 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
1:ST:ready-to-merge This PR is ready to merge. mod:cuda This is related to the CUDA module. type:solver This is related to the solvers
Projects
None yet
Development

Successfully merging this pull request may close these issues.

GMRES time improvement placeholder
5 participants