Using Vc::reciprocal() and SSE instead of division in OverCompositor32
ClosedPublic

Authored by Pandas on Aug 14 2017, 8:58 PM.

Details

Summary

Instead of the usual division, we can use the division of 1 by the number (reciprocal) and multiplying it by the dividend in the original fraction. Also, the use of Vc::reciprocal() and intrinsics using SSE in total gives us an acceleration of up to 30%.

.

Diff Detail

Repository
R37 Krita
Lint
Lint Skipped
Unit
Unit Tests Skipped
Pandas created this revision.Aug 14 2017, 8:58 PM
Restricted Application added a subscriber: woltherav. · View Herald TranscriptAug 14 2017, 8:58 PM

Is there any real reasons to use pass by reference? It looks like it can do fine with input parameters passed by value and have the result parameter returned directly. In any case modern compilers will definitely optimize this for you, and the functions will be inlined anyway. Using a return value will be clearer in any case.

Also, what's the template parameter _impl for? It doesn't seem like it's being used?

dkazakov edited edge metadata.Aug 15 2017, 7:12 AM

Is there any real reasons to use pass by reference? It looks like it can do fine with input parameters passed by value and have the result parameter returned directly.

Good point :)

@Pandas, could you check if the result of compiling this function with two types result return generates the same assembly code?
To check that, you should generate disassembly for KoOptimizedCompositeOpFactoryPerArch_AVX.cpp.o for two versions of the code: one with "return by reference parameter" and another one with usual return. To generate the assembly output you can just execute:

objdump -dSC ./libs/pigment/CMakeFiles/kritapigment.dir/KoOptimizedCompositeOpFactoryPerArch_AVX.cpp.o > assembly.txt

When you have two files generated, you can just compare the pieces of code that contain vrcpps and vrcpss instructions (these are your reciprocal instructions for "packed/scalar single precision floats", there are very few of them). The pieces of code should look exactly the same.

Also, what's the template parameter _impl for? It doesn't seem like it's being used?

This parameter is used to trick the compiler/linker about not trying to reuse/link the code in multiple .cpp files. We compile the same header in 5 different .cpp files with different compiler switches. And the template parameter is different for every .cpp file.

dkazakov requested changes to this revision.Aug 15 2017, 7:14 AM
dkazakov added inline comments.
libs/pigment/compositeops/KoOptimizedCompositeOpOver32.h
53

In case SSE is not available we should use normal division, like divident / divisor, not 1.0 / divisor. It is just faster.

This revision now requires changes to proceed.Aug 15 2017, 7:14 AM

@dkazakov , There are two outputs of objdump and the pieces of code with vrcpps and vrcpss are really the same. For example,

I checked all the pieces, but below just in case the full output:

@dkazakov , There are two outputs of objdump and the pieces of code with vrcpps and vrcpss are really the same. For example,

Then we can easily follow @alvinhochun's advice and just return the result from the function :)

Pandas updated this revision to Diff 18178.Aug 15 2017, 11:36 AM
Pandas edited edge metadata.

Using the return value

dkazakov accepted this revision.Aug 15 2017, 12:11 PM

Looks perfect now! :)

This revision is now accepted and ready to land.Aug 15 2017, 12:11 PM
This revision was automatically updated to reflect the committed changes.