-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Faster gaussian blur for OpenCL and CPU #17572
Faster gaussian blur for OpenCL and CPU #17572
Conversation
@TurboGit the maths has not changed at all and the gaussian kernels have been there before, just some editing. |
Sadly this SIGSEV on the colorequal guided-filter test:
The backtrace:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Crash on regression test 0159.
571e359
to
4accb30
Compare
Fixed a typo (in relevant code w3 was 8width but must be 6width) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Something else as the math should not have been changed and I see diff not on the border, see the huge diff:
Test 0159-coloreq-guided-filter
Image mire1.cr2
CPU & GPU version differ by 642732 pixels
CPU vs. GPU report :
----------------------------------
Max dE : 2.04421
Avg dE : 0.10405
Std dE : 0.20832
----------------------------------
Pixels below avg + 0 std : 77.11 %
Pixels below avg + 1 std : 81.07 %
Pixels below avg + 3 std : 98.12 %
Pixels below avg + 6 std : 99.98 %
Pixels below avg + 9 std : 100.00 %
----------------------------------
Pixels above tolerance : 0.00 %
Expected CPU vs. current CPU report :
----------------------------------
Max dE : 6.26490
Avg dE : 0.25191
Std dE : 0.34982
----------------------------------
Pixels below avg + 0 std : 59.02 %
Pixels below avg + 1 std : 87.11 %
Pixels below avg + 3 std : 98.44 %
Pixels below avg + 6 std : 99.94 %
Pixels below avg + 9 std : 100.00 %
----------------------------------
Pixels above tolerance : 0.07 %
FAILS: image visually changed
see diff.png for visual difference
(1.20198e+06 pixels changed)
The the diff image for CPU code path:
4accb30
to
f432c5b
Compare
Found wrong kernel parameter ... |
Although the standard gaussian blur code for cl buffers or CPU is pretty fast we can do even better for small sigmas. Using a simple NxN kernel is faster because we don't have to fiddle around with sorting data and especially on cards with lots of processing units and fast cl_mem or on CPU cached memory the algorithm is clearly faster due to cache locality. Both (OpenCL and CPU) fast gaussian variants support 1, 2 or 4 channels and can run on a 5x5, 7x7 or a 9x9 coeff matrix, the chosen matrix size depends on the sigma. (9x9 for sigma > 1.2, 7x7 for sigma > 0.7 or 5x5 if smaller than 0.7) Measured performance gain depends on the chosen sigma, performance vs "standard" is approximately both for OpenCL and CPU code: - 9x9 200% - 7x7 300% - 5x5 400% The calculated coeffs and the kernel leave out edge positions as they contribute almost nothing to the correct result, errors vs the standard gaussian for each kernel is less than 0.2% as long the sigma is <= 1.5 Please note: both functions now also do a proper gaussian at the borders (did not before this pr!).
While using small sigma for gaussian blurring we can use the faster dt_gaussian_fast_blur_cl_buffer() and dt_gaussian_fast_blur() variants. As gaussian blurring is a major performance bottleneck in color equalizer, the overall performance is almost doubled if the radii for the guided filter are small like with defaults.
For details threshold, dual demosaicing, mask blurring and segmentation gradients we used special gaussian blurs. All algorithms now make use of public dt_gaussian_fast_blur_cl_buffer() and dt_gaussian_fast_blur() functions, internal code has been removed.
f432c5b
to
2c91cae
Compare
I think i got the culprit. As the code has a maximum kernel size of 9x9 we should only use the fast variants with sigmas up to 1.5. (2.0 would require a 13x13...) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still broken on my side:
Test 0159-coloreq-guided-filter
Image mire1.cr2
CPU & GPU version differ by 642732 pixels
CPU vs. GPU report :
----------------------------------
Max dE : 2.04421
Avg dE : 0.10405
Std dE : 0.20832
----------------------------------
Pixels below avg + 0 std : 77.11 %
Pixels below avg + 1 std : 81.07 %
Pixels below avg + 3 std : 98.12 %
Pixels below avg + 6 std : 99.98 %
Pixels below avg + 9 std : 100.00 %
----------------------------------
Pixels above tolerance : 0.00 %
Expected CPU vs. current CPU report :
----------------------------------
Max dE : 6.26490
Avg dE : 0.25191
Std dE : 0.34982
----------------------------------
Pixels below avg + 0 std : 59.02 %
Pixels below avg + 1 std : 87.11 %
Pixels below avg + 3 std : 98.44 %
Pixels below avg + 6 std : 99.94 %
Pixels below avg + 9 std : 100.00 %
----------------------------------
Pixels above tolerance : 0.07 %
FAILS: image visually changed
see diff.png for visual difference
(1.20198e+06 pixels changed)
@TurboGit closing this for now until i fully understand the "issue". |
Implement
dt_gaussian_fast_blur_cl_buffer()
anddt_gaussian_fast_blur()
Although the standard gaussian blur code for cl buffers or CPU is pretty fast we can do even better for small sigmas.
Using a simple small NxN kernel is faster because we don't have to fiddle around with sorting data and especially on cards with lots of processing units and fast cl_mem or on CPU cached memory the algorithm is clearly faster due to cache locality. (a note: Ingo Weyrich - heckflosse - did a lot of testing for CPUs a while ago when developing capture sharpening. Up to 13x13 kernels he found a benefit)
Both (OpenCL and CPU) fast gaussian variants support 1, 2 or 4 channels and can run on a 5x5, 7x7 or a 9x9 coeff matrix, the chosen matrix size depends on the sigma. (9x9 for sigma > 1.2, 7x7 for sigma > 0.7 or 5x5 if smaller than 0.7)
Measured performance gain depends on the chosen sigma, performance vs "standard" is approximately both for OpenCL and CPU code:
The calculated coeffs and the kernel leave out edge positions as they contribute almost nothing to the correct result, errors vs the standard gaussian for each kernel is less than 0.2%.
Please note: both functions now also do a proper gaussian at the borders (did not before this pr!).
Color equalizer gets a performance boost
While using small sigma for gaussian blurring in the guided filter we can use the faster
dt_gaussian_fast_blur_cl_buffer()
anddt_gaussian_fast_blur()
variants.As gaussian blurring is a major performance bottleneck in color equalizer, the overall performance is almost doubled if the radii for the guided filter are small like with defaults.
Gaussian blurring maintenance
For details threshold, dual demosaicing, mask blurring and segmentation gradients we used special gaussian blurs.
All algorithms now make use of public
dt_gaussian_fast_blur_cl_buffer()
anddt_gaussian_fast_blur()
functions, internal code has been removed.