Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Faster gaussian blur for OpenCL and CPU #17572

Closed

Conversation

jenshannoschwalm
Copy link
Collaborator

@jenshannoschwalm jenshannoschwalm commented Oct 1, 2024

Implement dt_gaussian_fast_blur_cl_buffer() and dt_gaussian_fast_blur()

Although the standard gaussian blur code for cl buffers or CPU is pretty fast we can do even better for small sigmas.

Using a simple small NxN kernel is faster because we don't have to fiddle around with sorting data and especially on cards with lots of processing units and fast cl_mem or on CPU cached memory the algorithm is clearly faster due to cache locality. (a note: Ingo Weyrich - heckflosse - did a lot of testing for CPUs a while ago when developing capture sharpening. Up to 13x13 kernels he found a benefit)

Both (OpenCL and CPU) fast gaussian variants support 1, 2 or 4 channels and can run on a 5x5, 7x7 or a 9x9 coeff matrix, the chosen matrix size depends on the sigma. (9x9 for sigma > 1.2, 7x7 for sigma > 0.7 or 5x5 if smaller than 0.7)

Measured performance gain depends on the chosen sigma, performance vs "standard" is approximately both for OpenCL and CPU code:

  • 9x9 200%
  • 7x7 300%
  • 5x5 400%

The calculated coeffs and the kernel leave out edge positions as they contribute almost nothing to the correct result, errors vs the standard gaussian for each kernel is less than 0.2%.

Please note: both functions now also do a proper gaussian at the borders (did not before this pr!).

Color equalizer gets a performance boost

While using small sigma for gaussian blurring in the guided filter we can use the faster dt_gaussian_fast_blur_cl_buffer() and dt_gaussian_fast_blur() variants.

As gaussian blurring is a major performance bottleneck in color equalizer, the overall performance is almost doubled if the radii for the guided filter are small like with defaults.

Gaussian blurring maintenance

For details threshold, dual demosaicing, mask blurring and segmentation gradients we used special gaussian blurs.

All algorithms now make use of public dt_gaussian_fast_blur_cl_buffer() and dt_gaussian_fast_blur() functions, internal code has been removed.

@jenshannoschwalm jenshannoschwalm added scope: image processing correcting pixels scope: performance doing everything the same but faster scope: codebase making darktable source code easier to manage OpenCL Related to darktable OpenCL code labels Oct 1, 2024
@jenshannoschwalm jenshannoschwalm added this to the 5.0 milestone Oct 1, 2024
@jenshannoschwalm
Copy link
Collaborator Author

@TurboGit the maths has not changed at all and the gaussian kernels have been there before, just some editing.
Please note the last comment for the fast gaussians, we now calculate the borders correctly so there are expected differences at the borders for some integration tests (dual demosaicing, possibly segmentation, details threshold)

@TurboGit
Copy link
Member

TurboGit commented Oct 1, 2024

Sadly this SIGSEV on the colorequal guided-filter test:

Test 0159-coloreq-guided-filter
      Image mire1.cr2
./run.sh: line 56: 520279 Segmentation fault      $* > /dev/null 2> /dev/null
========== COMMAND fails
To reproduce and debug:
cd /home/obry/dev/git/darktable/src/tests/integration/0159-coloreq-guided-filter; gdb --args /opt/darktable/bin/darktable-cli --width 2048 --height 2048 --hq true --apply-custom-presets false /home/obry/dev/git/darktable/src/tests/integration/images/mire1.cr2 coloreq-guided-filter.xmp output.png --core --disable-opencl --conf host_memory_limit=8192 --conf resourcelevel=reference --conf worker_threads=4 -t 4 --conf plugins/lighttable/export/force_lcms2=FALSE --conf plugins/lighttable/export/iccintent=0

The backtrace:

(gdb) bt
#0  0x00007ffff792cf34 in _fast_9x9_kernel_2._omp_fn.0 ()
    at /home/obry/dev/builds/c-darktable/x86_64-linux-gnu-default/src/src/common/gaussian.c:646
#1  0x00007ffff7e0a866 in GOMP_parallel ()
    at /lib/x86_64-linux-gnu/libgomp.so.1
#2  0x00007ffff792e69e in _fast_9x9_kernel_2
    (input=0x7ffea3200040, output=0x7ffe8fa00040, width=<optimized out>, height=<optimized out>, sigma=<optimized out>, min=<optimized out>, max=<optimized out>)
    at /home/obry/dev/builds/c-darktable/x86_64-linux-gnu-default/src/src/common/gaussian.c:614
#3  dt_gaussian_fast_blur
    (in=0x7ffea3200040, out=0x7ffea3200040, width=<optimized out>, height=<optimized out>, sigma=<optimized out>, min=<optimized out>, max=<optimized out>, ch=<optimized out>)
    at /home/obry/dev/builds/c-darktable/x86_64-linux-gnu-default/src/src/common/gaussian.c:812
#4  0x00007fffed90ff45 in _prefilter_chromaticity
    (UV=<optimized out>, saturation=<optimized out>, width=3908, height=2601, sigma=<optimized out>, eps=<optimized out>, sat_shift=0.100320004)
    at /home/obry/dev/builds/c-darktable/x86_64-linux-gnu-default/src/src/iop/colorequal.c:672
#5  process
    (self=<optimized out>, piece=0x555557722a30, i=0x7fff24400040, o=0x7fff2e200040, roi_in=<optimized out>, roi_out=<optimized out>)
    at /home/obry/dev/builds/c-darktable/x86_64-linux-gnu-default/src/src/iop/colorequal.c:1074
#6  0x00007ffff7a2bb8d in _pixelpipe_process_on_CPU
    (pipe=<optimized out>, dev=0x7fffffff7870, input=<optimized out>, input_format=0x5555574b3c30, roi_in=<optimized out>, output=0x7ffffffef118, out_format=0x7ffffffef128, roi_out=0x7ffffffedf40, module=0x5555574e5770, piece=0x555557722a30, tiling=0x7ffffffedbc0, pixelpipe_flow=0x7ffffffedb50)
    at /home/obry/dev/builds/c-darktable/x86_64-linux-gnu-default/src/src/develop/pixelpipe_hb.c:1226
#7  0x00007ffff7a2ce1f in _dev_pixelpipe_process_rec
    (pipe=pipe@entry=0x7fffffff6e70, dev=dev@entry=0x7fffffff7870, output=output@entry=0x7ffffffef118, cl_mem_output=cl_mem_output@entry=0x7ffffffef120, out_format=out_format@entry=0x7ffffffef128, roi_out=roi_out@entry=0x7ffffffedf40, modules=0x5555575188e0 = {...}, pieces=0x555557724f80 = {...}, pos=51)
    at /home/obry/dev/builds/c-darktable/x86_64-linux-gnu-default/src/src/develop/pixelpipe_hb.c:2323
#8  0x00007ffff7a2c90b in _dev_pixelpipe_process_rec
    (pipe=pipe@entry=0x7fffffff6e70, dev=dev@entry=0x7fffffff7870, output=output@entry=0x7ffffffef118, cl_mem_output=cl_mem_output@entry=0x7ffffffef120, out_format=out_format@entry=0x7ffffffef128, roi_out=roi_out@entry=0x7ffffffee2e0, modules=0x555557518900 = {...}, pieces=0x555557725c00 = {...}, pos=52)
    at /home/obry/dev/builds/c-darktable/x86_64-linux-gnu-default/src/src/develop/pixelpipe_hb.c:1360

@TurboGit TurboGit self-requested a review October 1, 2024 15:16
Copy link
Member

@TurboGit TurboGit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Crash on regression test 0159.

@jenshannoschwalm
Copy link
Collaborator Author

Fixed a typo (in relevant code w3 was 8width but must be 6width)

Copy link
Member

@TurboGit TurboGit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something else as the math should not have been changed and I see diff not on the border, see the huge diff:

Test 0159-coloreq-guided-filter
      Image mire1.cr2
      CPU & GPU version differ by 642732 pixels
      CPU vs. GPU report :
      ----------------------------------
      Max dE                   : 2.04421
      Avg dE                   : 0.10405
      Std dE                   : 0.20832
      ----------------------------------
      Pixels below avg + 0 std : 77.11 %
      Pixels below avg + 1 std : 81.07 %
      Pixels below avg + 3 std : 98.12 %
      Pixels below avg + 6 std : 99.98 %
      Pixels below avg + 9 std : 100.00 %
      ----------------------------------
      Pixels above tolerance   : 0.00 %
 
      Expected CPU vs. current CPU report :
      ----------------------------------
      Max dE                   : 6.26490
      Avg dE                   : 0.25191
      Std dE                   : 0.34982
      ----------------------------------
      Pixels below avg + 0 std : 59.02 %
      Pixels below avg + 1 std : 87.11 %
      Pixels below avg + 3 std : 98.44 %
      Pixels below avg + 6 std : 99.94 %
      Pixels below avg + 9 std : 100.00 %
      ----------------------------------
      Pixels above tolerance   : 0.07 %
 
  FAILS: image visually changed
         see diff.png for visual difference
         (1.20198e+06 pixels changed)

The the diff image for CPU code path:

image

@jenshannoschwalm
Copy link
Collaborator Author

Found wrong kernel parameter ...

Although the standard gaussian blur code for cl buffers or CPU is pretty fast we can do even better for small sigmas.

Using a simple NxN kernel is faster because we don't have to fiddle around with sorting data and especially on cards
with lots of processing units and fast cl_mem or on CPU cached memory the algorithm is clearly faster due to cache locality.

Both (OpenCL and CPU) fast gaussian variants support 1, 2 or 4 channels and can run on a 5x5, 7x7 or a 9x9 coeff matrix,
the chosen matrix size depends on the sigma. (9x9 for sigma > 1.2, 7x7 for sigma > 0.7 or 5x5 if smaller than 0.7)

Measured performance gain depends on the chosen sigma, performance vs "standard" is approximately both for OpenCL and
CPU code:
 - 9x9 200%
 - 7x7 300%
 - 5x5 400%

The calculated coeffs and the kernel leave out edge positions as they contribute almost nothing to the correct result,
errors vs the standard gaussian for each kernel is less than 0.2% as long the sigma is <= 1.5

Please note: both functions now also do a proper gaussian at the borders (did not before this pr!).
While using small sigma for gaussian blurring we can use the faster
dt_gaussian_fast_blur_cl_buffer() and dt_gaussian_fast_blur() variants.

As gaussian blurring is a major performance bottleneck in color equalizer, the overall
performance is almost doubled if the radii for the guided filter are small like with defaults.
For details threshold, dual demosaicing, mask blurring and segmentation gradients we used
special gaussian blurs.

All algorithms now make use of public dt_gaussian_fast_blur_cl_buffer() and dt_gaussian_fast_blur()
functions, internal code has been removed.
@jenshannoschwalm
Copy link
Collaborator Author

I think i got the culprit. As the code has a maximum kernel size of 9x9 we should only use the fast variants with sigmas up to 1.5. (2.0 would require a 13x13...)

Copy link
Member

@TurboGit TurboGit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still broken on my side:

Test 0159-coloreq-guided-filter
      Image mire1.cr2
      CPU & GPU version differ by 642732 pixels
      CPU vs. GPU report :
      ----------------------------------
      Max dE                   : 2.04421
      Avg dE                   : 0.10405
      Std dE                   : 0.20832
      ----------------------------------
      Pixels below avg + 0 std : 77.11 %
      Pixels below avg + 1 std : 81.07 %
      Pixels below avg + 3 std : 98.12 %
      Pixels below avg + 6 std : 99.98 %
      Pixels below avg + 9 std : 100.00 %
      ----------------------------------
      Pixels above tolerance   : 0.00 %
 
      Expected CPU vs. current CPU report :
      ----------------------------------
      Max dE                   : 6.26490
      Avg dE                   : 0.25191
      Std dE                   : 0.34982
      ----------------------------------
      Pixels below avg + 0 std : 59.02 %
      Pixels below avg + 1 std : 87.11 %
      Pixels below avg + 3 std : 98.44 %
      Pixels below avg + 6 std : 99.94 %
      Pixels below avg + 9 std : 100.00 %
      ----------------------------------
      Pixels above tolerance   : 0.07 %
 
  FAILS: image visually changed
         see diff.png for visual difference
         (1.20198e+06 pixels changed)

@jenshannoschwalm
Copy link
Collaborator Author

@TurboGit closing this for now until i fully understand the "issue".

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
OpenCL Related to darktable OpenCL code scope: codebase making darktable source code easier to manage scope: image processing correcting pixels scope: performance doing everything the same but faster
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants