Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make ICOPY_OPERATION threaded? #4834

Open
ChipKerchner opened this issue Aug 1, 2024 · 1 comment
Open

Make ICOPY_OPERATION threaded? #4834

ChipKerchner opened this issue Aug 1, 2024 · 1 comment

Comments

@ChipKerchner
Copy link
Contributor

ChipKerchner commented Aug 1, 2024

In some multi-threaded versions of OpenBLAS, the ICOPY_OPERATION (packing for A) seems to take 6-13X more time than the actual KERNEL_OPERATION(s). Is it possible to make this operation multi-threaded?

    /* Copy local region of A into workspace */
    START_RPCC();
    ICOPY_OPERATION(min_l, min_i, a, lda, ls, m_from, sa);
    STOP_RPCC(copy_A);
  48.09%  pt_main_thread  libomp.so                                               [.] bool __kmp_wait_template<kmp_flag_64<false, true>, true, false, true>(kmp_info*, kmp_flag_64<false, true>*, void*)
  18.23%  pt_main_thread  libgomp.so.1.0.0                                        [.] do_wait
   9.01%  pt_main_thread  libopenblasp-r0.3.27.dev.so                             [.] sbgemm_incopy_POWER10
   2.85%  pt_main_thread  libopenblasp-r0.3.27.dev.so                             [.] sbgemm_kernel_POWER10
   1.33%  pt_main_thread  libopenblasp-r0.3.27.dev.so                             [.] inner_thread
@martin-frbg
Copy link
Collaborator

I don't think anybody has tried yet, but there is probably no fundamental argument against doing it (apart from the fact that the copy operation itself is already called from a multithreaded region). The least invasive way of trying would probably be to add the blas_level1_thread() mechanism to an existing ICOPY kernel (similar to how some of the DOT
kernels are parallelized on arm64 and x86_64 targets). But I must admit I have not really thought this through, just a first impression from me. I think we already have an open issue about low performance of ICOPY, though that may have been more concerned with the use of outdated instructions on modern targets.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants