Faster gaussian blur for OpenCL and CPU #17572

jenshannoschwalm · 2024-10-01T05:20:49Z

Implement dt_gaussian_fast_blur_cl_buffer() and dt_gaussian_fast_blur()

Although the standard gaussian blur code for cl buffers or CPU is pretty fast we can do even better for small sigmas.

Using a simple small NxN kernel is faster because we don't have to fiddle around with sorting data and especially on cards with lots of processing units and fast cl_mem or on CPU cached memory the algorithm is clearly faster due to cache locality. (a note: Ingo Weyrich - heckflosse - did a lot of testing for CPUs a while ago when developing capture sharpening. Up to 13x13 kernels he found a benefit)

Both (OpenCL and CPU) fast gaussian variants support 1, 2 or 4 channels and can run on a 5x5, 7x7 or a 9x9 coeff matrix, the chosen matrix size depends on the sigma. (9x9 for sigma > 1.2, 7x7 for sigma > 0.7 or 5x5 if smaller than 0.7)

Measured performance gain depends on the chosen sigma, performance vs "standard" is approximately both for OpenCL and CPU code:

9x9 200%
7x7 300%
5x5 400%

The calculated coeffs and the kernel leave out edge positions as they contribute almost nothing to the correct result, errors vs the standard gaussian for each kernel is less than 0.2%.

Please note: both functions now also do a proper gaussian at the borders (did not before this pr!).

Color equalizer gets a performance boost

While using small sigma for gaussian blurring in the guided filter we can use the faster dt_gaussian_fast_blur_cl_buffer() and dt_gaussian_fast_blur() variants.

As gaussian blurring is a major performance bottleneck in color equalizer, the overall performance is almost doubled if the radii for the guided filter are small like with defaults.

Gaussian blurring maintenance

For details threshold, dual demosaicing, mask blurring and segmentation gradients we used special gaussian blurs.

All algorithms now make use of public dt_gaussian_fast_blur_cl_buffer() and dt_gaussian_fast_blur() functions, internal code has been removed.

jenshannoschwalm · 2024-10-01T05:28:22Z

@TurboGit the maths has not changed at all and the gaussian kernels have been there before, just some editing.
Please note the last comment for the fast gaussians, we now calculate the borders correctly so there are expected differences at the borders for some integration tests (dual demosaicing, possibly segmentation, details threshold)

TurboGit · 2024-10-01T15:16:15Z

Sadly this SIGSEV on the colorequal guided-filter test:

Test 0159-coloreq-guided-filter
      Image mire1.cr2
./run.sh: line 56: 520279 Segmentation fault      $* > /dev/null 2> /dev/null
========== COMMAND fails
To reproduce and debug:
cd /home/obry/dev/git/darktable/src/tests/integration/0159-coloreq-guided-filter; gdb --args /opt/darktable/bin/darktable-cli --width 2048 --height 2048 --hq true --apply-custom-presets false /home/obry/dev/git/darktable/src/tests/integration/images/mire1.cr2 coloreq-guided-filter.xmp output.png --core --disable-opencl --conf host_memory_limit=8192 --conf resourcelevel=reference --conf worker_threads=4 -t 4 --conf plugins/lighttable/export/force_lcms2=FALSE --conf plugins/lighttable/export/iccintent=0

The backtrace:

(gdb) bt
#0  0x00007ffff792cf34 in _fast_9x9_kernel_2._omp_fn.0 ()
    at /home/obry/dev/builds/c-darktable/x86_64-linux-gnu-default/src/src/common/gaussian.c:646
#1  0x00007ffff7e0a866 in GOMP_parallel ()
    at /lib/x86_64-linux-gnu/libgomp.so.1
#2  0x00007ffff792e69e in _fast_9x9_kernel_2
    (input=0x7ffea3200040, output=0x7ffe8fa00040, width=<optimized out>, height=<optimized out>, sigma=<optimized out>, min=<optimized out>, max=<optimized out>)
    at /home/obry/dev/builds/c-darktable/x86_64-linux-gnu-default/src/src/common/gaussian.c:614
#3  dt_gaussian_fast_blur
    (in=0x7ffea3200040, out=0x7ffea3200040, width=<optimized out>, height=<optimized out>, sigma=<optimized out>, min=<optimized out>, max=<optimized out>, ch=<optimized out>)
    at /home/obry/dev/builds/c-darktable/x86_64-linux-gnu-default/src/src/common/gaussian.c:812
#4  0x00007fffed90ff45 in _prefilter_chromaticity
    (UV=<optimized out>, saturation=<optimized out>, width=3908, height=2601, sigma=<optimized out>, eps=<optimized out>, sat_shift=0.100320004)
    at /home/obry/dev/builds/c-darktable/x86_64-linux-gnu-default/src/src/iop/colorequal.c:672
#5  process
    (self=<optimized out>, piece=0x555557722a30, i=0x7fff24400040, o=0x7fff2e200040, roi_in=<optimized out>, roi_out=<optimized out>)
    at /home/obry/dev/builds/c-darktable/x86_64-linux-gnu-default/src/src/iop/colorequal.c:1074
#6  0x00007ffff7a2bb8d in _pixelpipe_process_on_CPU
    (pipe=<optimized out>, dev=0x7fffffff7870, input=<optimized out>, input_format=0x5555574b3c30, roi_in=<optimized out>, output=0x7ffffffef118, out_format=0x7ffffffef128, roi_out=0x7ffffffedf40, module=0x5555574e5770, piece=0x555557722a30, tiling=0x7ffffffedbc0, pixelpipe_flow=0x7ffffffedb50)
    at /home/obry/dev/builds/c-darktable/x86_64-linux-gnu-default/src/src/develop/pixelpipe_hb.c:1226
#7  0x00007ffff7a2ce1f in _dev_pixelpipe_process_rec
    (pipe=pipe@entry=0x7fffffff6e70, dev=dev@entry=0x7fffffff7870, output=output@entry=0x7ffffffef118, cl_mem_output=cl_mem_output@entry=0x7ffffffef120, out_format=out_format@entry=0x7ffffffef128, roi_out=roi_out@entry=0x7ffffffedf40, modules=0x5555575188e0 = {...}, pieces=0x555557724f80 = {...}, pos=51)
    at /home/obry/dev/builds/c-darktable/x86_64-linux-gnu-default/src/src/develop/pixelpipe_hb.c:2323
#8  0x00007ffff7a2c90b in _dev_pixelpipe_process_rec
    (pipe=pipe@entry=0x7fffffff6e70, dev=dev@entry=0x7fffffff7870, output=output@entry=0x7ffffffef118, cl_mem_output=cl_mem_output@entry=0x7ffffffef120, out_format=out_format@entry=0x7ffffffef128, roi_out=roi_out@entry=0x7ffffffee2e0, modules=0x555557518900 = {...}, pieces=0x555557725c00 = {...}, pos=52)
    at /home/obry/dev/builds/c-darktable/x86_64-linux-gnu-default/src/src/develop/pixelpipe_hb.c:1360

TurboGit

Crash on regression test 0159.

jenshannoschwalm · 2024-10-01T16:09:49Z

Fixed a typo (in relevant code w3 was 8width but must be 6width)

TurboGit

Something else as the math should not have been changed and I see diff not on the border, see the huge diff:

Test 0159-coloreq-guided-filter
      Image mire1.cr2
      CPU & GPU version differ by 642732 pixels
      CPU vs. GPU report :
      ----------------------------------
      Max dE                   : 2.04421
      Avg dE                   : 0.10405
      Std dE                   : 0.20832
      ----------------------------------
      Pixels below avg + 0 std : 77.11 %
      Pixels below avg + 1 std : 81.07 %
      Pixels below avg + 3 std : 98.12 %
      Pixels below avg + 6 std : 99.98 %
      Pixels below avg + 9 std : 100.00 %
      ----------------------------------
      Pixels above tolerance   : 0.00 %
 
      Expected CPU vs. current CPU report :
      ----------------------------------
      Max dE                   : 6.26490
      Avg dE                   : 0.25191
      Std dE                   : 0.34982
      ----------------------------------
      Pixels below avg + 0 std : 59.02 %
      Pixels below avg + 1 std : 87.11 %
      Pixels below avg + 3 std : 98.44 %
      Pixels below avg + 6 std : 99.94 %
      Pixels below avg + 9 std : 100.00 %
      ----------------------------------
      Pixels above tolerance   : 0.07 %
 
  FAILS: image visually changed
         see diff.png for visual difference
         (1.20198e+06 pixels changed)

The the diff image for CPU code path:

jenshannoschwalm · 2024-10-01T20:28:30Z

Found wrong kernel parameter ...

Although the standard gaussian blur code for cl buffers or CPU is pretty fast we can do even better for small sigmas. Using a simple NxN kernel is faster because we don't have to fiddle around with sorting data and especially on cards with lots of processing units and fast cl_mem or on CPU cached memory the algorithm is clearly faster due to cache locality. Both (OpenCL and CPU) fast gaussian variants support 1, 2 or 4 channels and can run on a 5x5, 7x7 or a 9x9 coeff matrix, the chosen matrix size depends on the sigma. (9x9 for sigma > 1.2, 7x7 for sigma > 0.7 or 5x5 if smaller than 0.7) Measured performance gain depends on the chosen sigma, performance vs "standard" is approximately both for OpenCL and CPU code: - 9x9 200% - 7x7 300% - 5x5 400% The calculated coeffs and the kernel leave out edge positions as they contribute almost nothing to the correct result, errors vs the standard gaussian for each kernel is less than 0.2% as long the sigma is <= 1.5 Please note: both functions now also do a proper gaussian at the borders (did not before this pr!).

While using small sigma for gaussian blurring we can use the faster dt_gaussian_fast_blur_cl_buffer() and dt_gaussian_fast_blur() variants. As gaussian blurring is a major performance bottleneck in color equalizer, the overall performance is almost doubled if the radii for the guided filter are small like with defaults.

For details threshold, dual demosaicing, mask blurring and segmentation gradients we used special gaussian blurs. All algorithms now make use of public dt_gaussian_fast_blur_cl_buffer() and dt_gaussian_fast_blur() functions, internal code has been removed.

jenshannoschwalm · 2024-10-02T04:29:58Z

I think i got the culprit. As the code has a maximum kernel size of 9x9 we should only use the fast variants with sigmas up to 1.5. (2.0 would require a 13x13...)

TurboGit

Still broken on my side:

Test 0159-coloreq-guided-filter
      Image mire1.cr2
      CPU & GPU version differ by 642732 pixels
      CPU vs. GPU report :
      ----------------------------------
      Max dE                   : 2.04421
      Avg dE                   : 0.10405
      Std dE                   : 0.20832
      ----------------------------------
      Pixels below avg + 0 std : 77.11 %
      Pixels below avg + 1 std : 81.07 %
      Pixels below avg + 3 std : 98.12 %
      Pixels below avg + 6 std : 99.98 %
      Pixels below avg + 9 std : 100.00 %
      ----------------------------------
      Pixels above tolerance   : 0.00 %
 
      Expected CPU vs. current CPU report :
      ----------------------------------
      Max dE                   : 6.26490
      Avg dE                   : 0.25191
      Std dE                   : 0.34982
      ----------------------------------
      Pixels below avg + 0 std : 59.02 %
      Pixels below avg + 1 std : 87.11 %
      Pixels below avg + 3 std : 98.44 %
      Pixels below avg + 6 std : 99.94 %
      Pixels below avg + 9 std : 100.00 %
      ----------------------------------
      Pixels above tolerance   : 0.07 %
 
  FAILS: image visually changed
         see diff.png for visual difference
         (1.20198e+06 pixels changed)

jenshannoschwalm · 2024-10-02T14:39:34Z

@TurboGit closing this for now until i fully understand the "issue".

jenshannoschwalm added scope: image processing correcting pixels scope: performance doing everything the same but faster scope: codebase making darktable source code easier to manage OpenCL Related to darktable OpenCL code labels Oct 1, 2024

jenshannoschwalm added this to the 5.0 milestone Oct 1, 2024

TurboGit self-requested a review October 1, 2024 15:16

TurboGit requested changes Oct 1, 2024

View reviewed changes

jenshannoschwalm force-pushed the faster_gaussian branch from 571e359 to 4accb30 Compare October 1, 2024 16:08

TurboGit requested changes Oct 1, 2024

View reviewed changes

jenshannoschwalm force-pushed the faster_gaussian branch from 4accb30 to f432c5b Compare October 1, 2024 20:27

jenshannoschwalm added 3 commits October 2, 2024 06:17

Gaussian blurring maintenance

2c91cae

For details threshold, dual demosaicing, mask blurring and segmentation gradients we used special gaussian blurs. All algorithms now make use of public dt_gaussian_fast_blur_cl_buffer() and dt_gaussian_fast_blur() functions, internal code has been removed.

jenshannoschwalm force-pushed the faster_gaussian branch from f432c5b to 2c91cae Compare October 2, 2024 04:25

TurboGit requested changes Oct 2, 2024

View reviewed changes

jenshannoschwalm closed this Oct 2, 2024

jenshannoschwalm deleted the faster_gaussian branch October 2, 2024 14:39

jenshannoschwalm mentioned this pull request Oct 4, 2024

Implement dt_gaussian_fast_blur_cl_buffer() and dt_gaussian_fast_blur() and maintenance #17593

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Faster gaussian blur for OpenCL and CPU #17572

Faster gaussian blur for OpenCL and CPU #17572

Uh oh!

jenshannoschwalm commented Oct 1, 2024 •

edited

Loading

Uh oh!

jenshannoschwalm commented Oct 1, 2024

Uh oh!

TurboGit commented Oct 1, 2024

Uh oh!

TurboGit left a comment

Uh oh!

jenshannoschwalm commented Oct 1, 2024

Uh oh!

TurboGit left a comment

Uh oh!

jenshannoschwalm commented Oct 1, 2024

Uh oh!

jenshannoschwalm commented Oct 2, 2024

Uh oh!

TurboGit left a comment

Uh oh!

jenshannoschwalm commented Oct 2, 2024

Uh oh!

Uh oh!

Faster gaussian blur for OpenCL and CPU #17572

Faster gaussian blur for OpenCL and CPU #17572

Uh oh!

Conversation

jenshannoschwalm commented Oct 1, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jenshannoschwalm commented Oct 1, 2024

Uh oh!

TurboGit commented Oct 1, 2024

Uh oh!

TurboGit left a comment

Choose a reason for hiding this comment

Uh oh!

jenshannoschwalm commented Oct 1, 2024

Uh oh!

TurboGit left a comment

Choose a reason for hiding this comment

Uh oh!

jenshannoschwalm commented Oct 1, 2024

Uh oh!

jenshannoschwalm commented Oct 2, 2024

Uh oh!

TurboGit left a comment

Choose a reason for hiding this comment

Uh oh!

jenshannoschwalm commented Oct 2, 2024

Uh oh!

Uh oh!

jenshannoschwalm commented Oct 1, 2024 •

edited

Loading