Instead of the usual division, we can use the division of 1 by the number (reciprocal) and multiplying it by the dividend in the original fraction. Also, the use of Vc::reciprocal() and intrinsics using SSE in total gives us an acceleration of up to 30%.
.Details
Diff Detail
- Repository
- R37 Krita
- Lint
Lint Skipped - Unit
Unit Tests Skipped
Is there any real reasons to use pass by reference? It looks like it can do fine with input parameters passed by value and have the result parameter returned directly. In any case modern compilers will definitely optimize this for you, and the functions will be inlined anyway. Using a return value will be clearer in any case.
Also, what's the template parameter _impl for? It doesn't seem like it's being used?
Good point :)
@Pandas, could you check if the result of compiling this function with two types result return generates the same assembly code?
To check that, you should generate disassembly for KoOptimizedCompositeOpFactoryPerArch_AVX.cpp.o for two versions of the code: one with "return by reference parameter" and another one with usual return. To generate the assembly output you can just execute:
objdump -dSC ./libs/pigment/CMakeFiles/kritapigment.dir/KoOptimizedCompositeOpFactoryPerArch_AVX.cpp.o > assembly.txt
When you have two files generated, you can just compare the pieces of code that contain vrcpps and vrcpss instructions (these are your reciprocal instructions for "packed/scalar single precision floats", there are very few of them). The pieces of code should look exactly the same.
Also, what's the template parameter _impl for? It doesn't seem like it's being used?
This parameter is used to trick the compiler/linker about not trying to reuse/link the code in multiple .cpp files. We compile the same header in 5 different .cpp files with different compiler switches. And the template parameter is different for every .cpp file.
libs/pigment/compositeops/KoOptimizedCompositeOpOver32.h | ||
---|---|---|
53 | In case SSE is not available we should use normal division, like divident / divisor, not 1.0 / divisor. It is just faster. |
@dkazakov , There are two outputs of objdump and the pieces of code with vrcpps and vrcpss are really the same. For example,
I checked all the pieces, but below just in case the full output:
Then we can easily follow @alvinhochun's advice and just return the result from the function :)