optimization of TextLineData::attribute
ClosedPublic

Authored by jtamate on Mar 19 2018, 2:45 PM.

Details

Summary

Use a binary search for the needed Attribute.
Also moved the implementation from the header.

From 76,84% of cpu

to 0.08% of cpu

in callgrind doing the following:

Open an XML file with 4 lines and a line of 566039 characters long.
Accept to reopen the file, move to the end of the file (ctrl+end), and close.

The time difference can also be noticed in the autotests.

Test Plan

Open a XML file with 4 lines and a line of 566039 characters long.
Accept to reopen the file, move to the end of the file (ctrl+end), and close.

Diff Detail

Repository
R39 KTextEditor
Lint
Automatic diff as part of commit; lint not applicable.
Unit
Automatic diff as part of commit; unit tests not applicable.
jtamate created this revision.Mar 19 2018, 2:45 PM
Restricted Application added projects: Kate, Frameworks. · View Herald TranscriptMar 19 2018, 2:45 PM
jtamate requested review of this revision.Mar 19 2018, 2:45 PM

Looks reasonable.
Could one use a C++11 range-based for?
Given the container is const in that context, it won't detach, or?

jtamate updated this revision to Diff 29925.Mar 19 2018, 3:35 PM
jtamate edited the summary of this revision. (Show Details)
jtamate edited the test plan for this revision. (Show Details)

Same results, but more concise syntax with the range-based for.

jtamate updated this revision to Diff 29975.Mar 20 2018, 8:26 AM
jtamate retitled this revision from simple optimization of TextLineData::attribute to optimization of TextLineData::attribute.
jtamate edited the summary of this revision. (Show Details)
jtamate edited the test plan for this revision. (Show Details)

As m_attributesList is sorted, otherwise the original implementation will not work,
do a binary search looking for the lowest item that has offset >= pos.

It gets even better, 0,32% of cpu in callgrind.

Hmm, does it really get that much faster or is the stuff just moved to the subroutines?
I would not have thought that something is faster than simple linear scan for a short vector.

mwolff added a subscriber: mwolff.Mar 20 2018, 2:19 PM

can you publish the test file and how you measured it? Then I can test it with perf too, to confirm the impact.

Hmm, does it really get that much faster or is the stuff just moved to the subroutines?

The sum of the subroutines time is lower than the routine time as shown in kcachegrind.

I would not have thought that something is faster than simple linear scan for a short vector.

I'm sorry. The third patch is totally wrong, I changed the >= comparison sides in my mental model. :-(

can you publish the test file and how you measured it? Then I can test it with perf too, to confirm the impact.

I'm sorry, I can't. It contains too much private information to be removed.
But it was created using the python program rtf2xml.py

I still am not that convinced that binary search helps for vectors which normally have only << 10-20 elements for normal long text lines.
But I would be ok with some range based for variant, as the loop with index increment is even ugly without any speed in mind ;=)

jtamate updated this revision to Diff 30117.Mar 21 2018, 12:14 PM
jtamate edited the summary of this revision. (Show Details)

The problem with attribute() is that with a long line and word wrapping enabled,
when the long line is shown, it is called as many times as the length of the line,
in my test case, 566.039 times.

The list is not always so small...In my test file, the longest list has 7257 items.
Therefore, previously it had to go trough, in average, 3628 items 566.039 times.

Kate and Okteta are the only tools that can handle such files and be responsive to user input.

Ok, that is true, for the "degenerated" very long line case it makes sense to do the binary search.
But I don't think you should do that on your own, lower_bound is good enough, we use that for other things like the folding, too, if I am not mistaken.

yes, definitely don't roll your own lower_bound - use the STL provided one. Are you really compiling in release mode while measuring this? Also, I can only repeat myself in saying that you shouldn't use callgrind for performance measurements anymore, perf/hotspot should give you a much better view of where the CPU cycles are spent.

yes, definitely don't roll your own lower_bound - use the STL provided one. Are you really compiling in release mode while measuring this? Also, I can only repeat myself in saying that you shouldn't use callgrind for performance measurements anymore, perf/hotspot should give you a much better view of where the CPU cycles are spent.

One of the thinks I like about using callgrind, well, kcachegrind, is that I have the tree of calls to one method, that I don't have in perf report.
As I said, I've been unable to compile hotspot yet, because I don't have the KF5 devel libraries installed from the distro, and I do not know how to specify an alternative directory in hotspsot cmake.

I'll try again with std::lower_bound, I'll take a look at folding, but I was unable to make it work last time.
There is a big difference between lower_bound and this implementation: lower_bound does a < checking, while this does a <=.

yes, definitely don't roll your own lower_bound - use the STL provided one. Are you really compiling in release mode while measuring this? Also, I can only repeat myself in saying that you shouldn't use callgrind for performance measurements anymore, perf/hotspot should give you a much better view of where the CPU cycles are spent.

One of the thinks I like about using callgrind, well, kcachegrind, is that I have the tree of calls to one method, that I don't have in perf report.
As I said, I've been unable to compile hotspot yet, because I don't have the KF5 devel libraries installed from the distro, and I do not know how to specify an alternative directory in hotspsot cmake.

Huh, just specify -DCMAKE_INSTALL_PREFIX to the same path where you install the KF5 devel packages to, as a simple fix. Or try to set CMAKE_LIBRARY_PATH, CMAKE_INCLUDE_PATH` and CMAKE_PREFIX_PATH env vars. Or well, just use the appimage?

I'll try again with std::lower_bound, I'll take a look at folding, but I was unable to make it work last time.
There is a big difference between lower_bound and this implementation: lower_bound does a < checking, while this does a <=.

Do a lower_bound and then increment from there, or maybe try to use upper_bound? I'm not acquainted with the exact semantics you are looking for, but in general: Don't reinvent the wheel, but rather try to aggregate existing building blocks.

jtamate updated this revision to Diff 30134.Mar 21 2018, 3:58 PM

Huh, just specify -DCMAKE_INSTALL_PREFIX to the same path where you install the KF5 devel packages to, as a simple fix.

Thanks, that worked.

Are you really compiling in release mode while measuring this?

Yes. cmake-options -DCMAKE_BUILD_TYPE=RelWithDebInfo

I misunderstood the role of the comparison function, Now I have it right.

Looking at TextLineData::addAttribute, it doesn't seem to sort the data - how can you be sure that the attributes list really is sorted by offset + length? I think this should be done manually there, or at least asserted. The old code apparently also relied on the order of attributes after all.

@cullmann @dhaumann: can we have overlapping attributes? If so, then this opens another can of worms when we start to use binary searches over *ranges*.

Note that we could think about an alternative (could also be combined with your patch) approach to speeding up this function: Pass around the previously found position. The way I guess this hotspot arises is that some function calls attribute on every pos in the line, linearly. We can leverage this and continue from where we left off the last time we returned.

Really, I'd like to get my hands on one of your large XML files to play around with this myself :)

Looking at TextLineData::addAttribute, it doesn't seem to sort the data - how can you be sure that the attributes list really is sorted by offset + length? I think this should be done manually there, or at least asserted. The old code apparently also relied on the order of attributes after all.

For what I've seen in my traces, the attributes are sorted.
How do you assert on the order of a list? I do not know how to do that.

@cullmann @dhaumann: can we have overlapping attributes? If so, then this opens another can of worms when we start to use binary searches over *ranges*.

Note that we could think about an alternative (could also be combined with your patch) approach to speeding up this function:
Pass around the previously found position.
The way I guess this hotspot arises is that some function calls attribute on every pos in the line, linearly. We can leverage this and continue from where we left off the last time we returned.

The function is KateSpellCheckManager::spellCheckWrtHighlightingRanges in src/spellcheck/spellcheck.cpp(176)
I tried this approach. Didn't work for me. :-(
You have to store that information in a pointer, because the method is const.

Really, I'd like to get my hands on one of your large XML files to play around with this myself :)

I'll try to generate one :with no private data ;-)

Uploaded a similar file


It is created using http://interglacial.com/~sburke/pub/rtf2xml.html from a rtf with two images and some text.

I'm sorry, the previous file lines are 10 times longer than expected.
This one {F5761669}can be managed by kate, not only by okteta.

Looking at TextLineData::addAttribute, it doesn't seem to sort the data - how can you be sure that the attributes list really is sorted by offset + length? I think this should be done manually there, or at least asserted. The old code apparently also relied on the order of attributes after all.

For what I've seen in my traces, the attributes are sorted.
How do you assert on the order of a list? I do not know how to do that.

You assert that newly added items actually sort-after the last added item in the vector

@cullmann @dhaumann: can we have overlapping attributes? If so, then this opens another can of worms when we start to use binary searches over *ranges*.

Note that we could think about an alternative (could also be combined with your patch) approach to speeding up this function:
Pass around the previously found position.
The way I guess this hotspot arises is that some function calls attribute on every pos in the line, linearly. We can leverage this and continue from where we left off the last time we returned.

The function is KateSpellCheckManager::spellCheckWrtHighlightingRanges in src/spellcheck/spellcheck.cpp(176)
I tried this approach. Didn't work for me. :-(
You have to store that information in a pointer, because the method is const.

we can return the iterator and take it as an argument again

Really, I'd like to get my hands on one of your large XML files to play around with this myself :)

I'll try to generate one :with no private data ;-)

thanks :)

When I open your fake file in kwrite, then raise the line length limit and wait for the file to be rendered (which takes quite some time!), perf does not show any performance issues with attribute for me. It only corresponds to ~1.3% of the CPU cycles. QTextLine::layout_helper on the other hand consumes ~86.2% of the CPU cycles!

So now again I wonder: why are you seeing such drastically different results - these orders of magnitude I cannot easily explain by differences in valgrind vs. perf.

Considering spell checking is involved - can you show a screenshot for how the file looks like for you? There shouldn't be a lot of spell checking going on, or so I hope...

For the sorting/overlapping question: actually the HL code should ensure there are neither overlaps nor unsorted entries.
And I think if there are, already now "strange" things happen, as normally the first item would win.

anthonyfieroni added inline comments.
src/buffer/katetextline.cpp
214–216

Use operator->, it's faster than operator* and operator.

first->offset

Considering spell checking is involved - can you show a screenshot for how the file looks like for you? There shouldn't be a lot of spell checking going on, or so I hope...

opening the file and accepting to extend the line limit.
after pressing ctrl+end

mwolff added inline comments.Mar 22 2018, 12:36 PM
src/buffer/katetextline.cpp
214–216

sorry, but that's simply not true at all.

Stylistic wise I agree, but an optimizing compiler will generate the same code for both.

@jtamate looking at your screenshots, it represents closely what I see locally. Most notably, there are no red underlines in your screenshots which could arise due to spell checking. Thus I really wonder why you are seeing such a big hotspot there.

Try perf, or try a poor mans profiler like GDB and regularly interrupt. Do you really end up in TextLineData::attribute()? Or, alternatively: Measure the time it takes for kate/kwrite to open the file and then go to the end. Then compare this before and after your patch. Do you see anything in the order of ~75% reduction for the time then too? Note how callgrind only measure instructions, so a supposed reduction of 75% of instructions should certainly have an impact on time too - of course not 75% too... I simply cannot fathom why you are seeing such an impact but I cannot reproduce this at all!

@jtamate looking at your screenshots, it represents closely what I see locally. Most notably, there are no red underlines in your screenshots which could arise due to spell checking. Thus I really wonder why you are seeing such a big hotspot there.

Try perf, or try a poor mans profiler like GDB and regularly interrupt. Do you really end up in TextLineData::attribute()? Or, alternatively: Measure the time it takes for kate/kwrite to open the file and then go to the end. Then compare this before and after your patch. Do you see anything in the order of ~75% reduction for the time then too? Note how callgrind only measure instructions, so a supposed reduction of 75% of instructions should certainly have an impact on time too - of course not 75% too... I simply cannot fathom why you are seeing such an impact but I cannot reproduce this at all!

I've done some measurements, as the times are so big, with a stopwatch 2 times each test.
With "Enable autodetection of Language" and "Automatic spell checking enabled by default" enabled,
the test as before: since pressing "Temporarily raise limit and reload file", press Ctrl+end and finish the scroll to the end of the document.

without any version of the patch:
1 min 25 seconds
With @mwolf solution, used only in spellCheckWrtHighlightingRanges.
38 seconds
With the binary search:
34 seconds

@jtamate looking at your screenshots, it represents closely what I see locally. Most notably, there are no red underlines in your screenshots which could arise due to spell checking. Thus I really wonder why you are seeing such a big hotspot there.

Try perf, or try a poor mans profiler like GDB and regularly interrupt. Do you really end up in TextLineData::attribute()? Or, alternatively: Measure the time it takes for kate/kwrite to open the file and then go to the end. Then compare this before and after your patch. Do you see anything in the order of ~75% reduction for the time then too? Note how callgrind only measure instructions, so a supposed reduction of 75% of instructions should certainly have an impact on time too - of course not 75% too... I simply cannot fathom why you are seeing such an impact but I cannot reproduce this at all!

I've done some measurements, as the times are so big, with a stopwatch 2 times each test.
With "Enable autodetection of Language" and "Automatic spell checking enabled by default" enabled,
the test as before: since pressing "Temporarily raise limit and reload file", press Ctrl+end and finish the scroll to the end of the document.

without any version of the patch:
1 min 25 seconds
With @mwolf solution, used only in spellCheckWrtHighlightingRanges.
38 seconds
With the binary search:
34 seconds

OK, this is a huge win, so I guess one way or another we should integrate this patch. One question to that though: Why do you sort/lookup by x.offset + x.length <= p? Note how lower_bound returns the first iterator that is _not_ going to return true. To me, it looks like your code cannot actually work and will always return 0? Personally, I'd try to use upper_bound with x.offset < p in the comparison. The iterator should then point to the first item that has it's offset larger than p. So decrementing the iterator once (while checking against begin()) yields the iterator that could potentially match. Thus, check if p is contained in its range and if so return it's attribute, otherwise return 0.

Besides this: I am still looking for an explanation why spell checking is so extremely slow for you. I have the same settings enabled, and spell checking is seemingly fast for me... Am I missing some dictionary or something other to reproduce this?

Also, what is "@mwolf solution" - I didn't provide any code, until now:

diff --git a/src/spellcheck/spellcheck.cpp b/src/spellcheck/spellcheck.cpp
index 9e04d788..3a97e5c5 100644
--- a/src/spellcheck/spellcheck.cpp
+++ b/src/spellcheck/spellcheck.cpp
@@ -170,10 +170,32 @@ QList<QPair<KTextEditor::Range, QString> > KateSpellCheckManager::spellCheckWrtH
             if (!kateTextLine) {
                 continue;    // bug #303496
             }
+            const auto attributes = kateTextLine->attributesList();
+            auto attributes_it = attributes.begin();
+            const auto attributes_end = attributes.end();
+            auto find_attr = [&attributes_it, attributes_end](int pos) -> int {
+                // bail out early if possible
+                if (attributes_it == attributes_end || attributes_it->offset > pos) {
+                    return 0;
+                }
+                // advance until we encompass the position
+                while (pos >= (attributes_it->offset + attributes_it->length)) {
+                    ++attributes_it;
+                    if (attributes_it == attributes_end) {
+                        return 0;
+                    }
+                }
+                // now check if we have a match
+                if (attributes_it->offset <= pos) {
+                    return attributes_it->attributeValue;
+                } else {
+                    return 0;
+                }
+            };
             const int start = (line == startLine) ? startColumn : 0;
             const int end = (line == endLine) ? endColumn : kateTextLine->length();
             for (int i = start; i < end;) { // WARNING: 'i' has to be incremented manually!
-                int attr = kateTextLine->attribute(i);
+                int attr = find_attr(i);
                 const KatePrefixStore &prefixStore = highlighting->getCharacterEncodingsPrefixStore(attr);
                 QString prefixFound = prefixStore.findPrefix(kateTextLine, i);
                 if (!document->highlight()->attributeRequiresSpellchecking(static_cast<unsigned int>(attr))

Can you try that locally and see how it goes for you? Without a way to reproduce, I can't say how it performance as it's fast before and after for me :) With some QElapsedTimer directly in this method, I see that the above is faster than the previous method, but it drowns in the overall measurements. Anyhow, this should in theory be better than any binary search as it will only go through the list at most once per line, whereas the binary search has to find the start position repeatedly.

But as I said, both approaches could be combined if we want to. And of course if we decide to follow my suggestion, then I'd need to clean it up to put the code directly into TextLineData.

One question to that though: Why do you sort/lookup by x.offset + x.length <= p? Note how lower_bound returns the first iterator that is _not_ going to return true.

Assuming there are neither overlaps nor unsorted entries.
Lets call X the iterator returned by lower_bound, suppose X is not cend(), so X.offset + X.length > p.
If other iterator Y could satisfy Y.offset + Y.length > p and Y.offset <= X.offset, it means there are overlaps, contradiction.
Therefore, the rest of the iterators has the following properties:
Y.offset + Y.length > p and Y.offset > X.offset
or
Y.offset + Y.length <= p and Y.offset < X.offset

To me, it looks like your code cannot actually work and will always return 0?

Nope. Otherwise the tests fail, more precisely kateindenttest.

Personally, I'd try to use upper_bound with x.offset < p in the comparison. The iterator should then point to the first item that has it's offset larger than p. So decrementing the iterator once (while checking against begin()) yields the iterator that could potentially match. Thus, check if p is contained in its range and if so return it's attribute, otherwise return 0.

I've tried. The code gets uglier.

Besides this: I am still looking for an explanation why spell checking is so extremely slow for you. I have the same settings enabled, and spell checking is seemingly fast for me... Am I missing some dictionary or something other to reproduce this?

Perhaps some Ignored words? I have Amarok, KAddressBook, KDevelop, KHTML, KIO, ....

Also, what is "@mwolf solution"

we can return the iterator and take it as an argument again. I tried locally the python style, returning a QPair.

Can you try that locally and see how it goes for you?

I only get one second less, 33 seconds. The problem is that attribute() is called from more places. Is it worth to have two implementations?

One question to that though: Why do you sort/lookup by x.offset + x.length <= p? Note how lower_bound returns the first iterator that is _not_ going to return true.

Assuming there are neither overlaps nor unsorted entries.
Lets call X the iterator returned by lower_bound, suppose X is not cend(), so X.offset + X.length > p.

Ah right, the offset + length is larger, not the offset itself - that's the error in my reasoning - thanks for clearing that up!
Still, you'd get the same with an upper_bound and x.offset + x.length < p search, no? I ask this, because to me it is somewhat odd to use a <= comparison for a lower_bound.

If other iterator Y could satisfy Y.offset + Y.length > p and Y.offset <= X.offset, it means there are overlaps, contradiction.
Therefore, the rest of the iterators has the following properties:
Y.offset + Y.length > p and Y.offset > X.offset
or
Y.offset + Y.length <= p and Y.offset < X.offset

To me, it looks like your code cannot actually work and will always return 0?

Nope. Otherwise the tests fail, more precisely kateindenttest.

Personally, I'd try to use upper_bound with x.offset < p in the comparison. The iterator should then point to the first item that has it's offset larger than p. So decrementing the iterator once (while checking against begin()) yields the iterator that could potentially match. Thus, check if p is contained in its range and if so return it's attribute, otherwise return 0.

I've tried. The code gets uglier.

See above, upper_bound with offset + length < p should return the same iterator as your search now, but be more natural (imo) to what one would see elsewhere.

Besides this: I am still looking for an explanation why spell checking is so extremely slow for you. I have the same settings enabled, and spell checking is seemingly fast for me... Am I missing some dictionary or something other to reproduce this?

Perhaps some Ignored words? I have Amarok, KAddressBook, KDevelop, KHTML, KIO, ....

Can you maybe check how often, and for what arguments spellCheckWrtHighlightingRanges is called for you? I'll do the same, so maybe we can figure out what is going on then. Also, I'll try with a clean configuration and see if anything is different there.

Also, what is "@mwolf solution"

we can return the iterator and take it as an argument again. I tried locally the python style, returning a QPair.

Can you try that locally and see how it goes for you?

I only get one second less, 33 seconds. The problem is that attribute() is called from more places. Is it worth to have two implementations?

OK - thanks for measuring. I was hoping for a bigger difference. I aggre that we can ignore the 3% improvement and stay with your binary search then as it's generic and will work without requiring code changes elsewhere.

jtamate updated this revision to Diff 30331.Mar 23 2018, 4:08 PM
jtamate edited the summary of this revision. (Show Details)

Using upper_bound with <
Changed the name of the iterator
Using -> instead of (*x)

These are the calls I get since I open the document:
doc= KTextEditor::DocumentPrivate(0x228fe10) range= [ (0, 0) -> (0, 0) ] dictionary= "" singleLine= true returnSingleRange= false
doc= KTextEditor::DocumentPrivate(0x228fe10) range= [ (1, 0) -> (1, 4096) ] dictionary= "" singleLine= true returnSingleRange= false
doc= KTextEditor::DocumentPrivate(0x228fe10) range= [ (0, 0) -> (0, 4090) ] dictionary= "" singleLine= true returnSingleRange= false
sonnet.core: Missing trigrams for languages: QSet("en_CA", "en_GB", "en_AU")
doc= KTextEditor::DocumentPrivate(0x228fe10) range= [ (0, 0) -> (0, 358720) ] dictionary= "" singleLine= true returnSingleRange= false
doc= KTextEditor::DocumentPrivate(0x228fe10) range= [ (5, 0) -> (5, 162682) ] dictionary= "" singleLine= true returnSingleRange= false

+1 from my side, @cullmann, @dhaumann ?

I'll continue to figure out why I don't see the performance issue here

cullmann accepted this revision.Mar 23 2018, 4:33 PM

Yep, makes sense.
Thanks for the improvement.

This revision is now accepted and ready to land.Mar 23 2018, 4:33 PM
This revision was automatically updated to reflect the committed changes.
jtamate marked 2 inline comments as done.

@jtamate I just checked, the function is called with the same parameters for me locally. What output do you get for this:

diff --git a/src/spellcheck/spellcheck.cpp b/src/spellcheck/spellcheck.cpp
index 9e04d788..50e92885 100644
--- a/src/spellcheck/spellcheck.cpp
+++ b/src/spellcheck/spellcheck.cpp
@@ -133,6 +133,7 @@ QList<QPair<KTextEditor::Range, QString> > KateSpellCheckManager::spellCheckWrtH
         bool singleLine,
         bool returnSingleRange)
 {
+    qDebug() << range << dictionary << singleLine << returnSingleRange;
     QList<QPair<KTextEditor::Range, QString> > toReturn;
     if (range.isEmpty()) {
         return toReturn;
@@ -172,6 +173,7 @@ QList<QPair<KTextEditor::Range, QString> > KateSpellCheckManager::spellCheckWrtH
             }
             const int start = (line == startLine) ? startColumn : 0;
             const int end = (line == endLine) ? endColumn : kateTextLine->length();
+            qDebug() << line << kateTextLine->attributesList().size() << (end - start);
             for (int i = start; i < end;) { // WARNING: 'i' has to be incremented manually!
                 int attr = kateTextLine->attribute(i);
                 const KatePrefixStore &prefixStore = highlighting->getCharacterEncodingsPrefixStore(attr);

For me this is the interesting bit:

15.097 debug: unknown[unknown:0]: [ (0, 0)  ->  (0, 358720) ] "" true false
15.097 debug: unknown[unknown:0]: 0 1979 358720
23.013 debug: unknown[unknown:0]: [ (5, 0)  ->  (5, 162682) ] "" true false
23.013 debug: unknown[unknown:0]: 5 12 162682

So for line 0 I should actually get quite abysmal performance too, worst case 1979 * 358720, but it flies right through here... Now I wonder what kind of CPU you are using? I have a i7-5600U CPU @ 2.60GHz here in my laptop, which has:

$ lstopo --of console
Machine (11GB)
  Package L#0 + L3 L#0 (4096KB)
    L2 L#0 (256KB) + L1d L#0 (32KB) + L1i L#0 (32KB) + Core L#0
      PU L#0 (P#0)
      PU L#1 (P#1)
    L2 L#1 (256KB) + L1d L#1 (32KB) + L1i L#1 (32KB) + Core L#1
      PU L#2 (P#2)
      PU L#3 (P#3)

The 1979 entries of attributes (each 12byte large) produce ~24KB of data, which fits nicely within my L1 cache. Maybe I'm just lucky and this is what hides this issue from me?

@jtamate I just checked, the function is called with the same parameters for me locally. What output do you get for this:

For me this is the interesting bit:

15.097 debug: unknown[unknown:0]: [ (0, 0)  ->  (0, 358720) ] "" true false
15.097 debug: unknown[unknown:0]: 0 1979 358720
23.013 debug: unknown[unknown:0]: [ (5, 0)  ->  (5, 162682) ] "" true false
23.013 debug: unknown[unknown:0]: 5 12 162682

So for line 0 I should actually get quite abysmal performance too, worst case 1979 * 358720, but it flies right through here... Now I wonder what kind of CPU you are using? I have a i7-5600U CPU @ 2.60GHz here in my laptop, which has:

$ lstopo --of console
Machine (11GB)
  Package L#0 + L3 L#0 (4096KB)
    L2 L#0 (256KB) + L1d L#0 (32KB) + L1i L#0 (32KB) + Core L#0
      PU L#0 (P#0)
      PU L#1 (P#1)
    L2 L#1 (256KB) + L1d L#1 (32KB) + L1i L#1 (32KB) + Core L#1
      PU L#2 (P#2)
      PU L#3 (P#3)

The 1979 entries of attributes (each 12byte large) produce ~24KB of data, which fits nicely within my L1 cache. Maybe I'm just lucky and this is what hides this issue from me?

Same debug output.

AMD Phenom(tm) II X6 1100T Processor, 3.30 GHz
Machine (16GB)

Package L#0 + L3 L#0 (6144KB)
  L2 L#0 (512KB) + L1d L#0 (64KB) + L1i L#0 (64KB) + Core L#0 + PU L#0 (P#0)
  L2 L#1 (512KB) + L1d L#1 (64KB) + L1i L#1 (64KB) + Core L#1 + PU L#1 (P#1)
  L2 L#2 (512KB) + L1d L#2 (64KB) + L1i L#2 (64KB) + Core L#2 + PU L#2 (P#2)
  L2 L#3 (512KB) + L1d L#3 (64KB) + L1i L#3 (64KB) + Core L#3 + PU L#3 (P#3)
  L2 L#4 (512KB) + L1d L#4 (64KB) + L1i L#4 (64KB) + Core L#4 + PU L#4 (P#4)
  L2 L#5 (512KB) + L1d L#5 (64KB) + L1i L#5 (64KB) + Core L#5 + PU L#5 (P#5)

WTF :D Your desktop CPU has clearly a better performance than my mobile CPU, no? Is AMD really so much worse here? How can that be - I don't get it :D

Anyhow, I give up trying to understand this now - thanks a lot for your repeated input Jaime!

WTF :D Your desktop CPU has clearly a better performance than my mobile CPU, no? Is AMD really so much worse here? How can that be - I don't get it :D

Anyhow, I give up trying to understand this now - thanks a lot for your repeated input Jaime!

I've found the problem for the poor performance. In .kdesrc-buildrc, I have also a like cxxflags # -fsanitize=address -fsanitize=signed-integer-overflow -fsanitize=bounds-strict -fsanitize=undefined -fsanitize-recover=address
That line removes the -O2 from the compiler options. Added manually in the way of cxxflags -O2 -mtune=native # -fsanitize=address .....
I "only" get 10 seconds less, 22 seconds with the patch applied.

mwolff added a comment.Apr 3 2018, 4:24 PM

Ah, I knew it :D I suspected missing compiler optimizations from the start... But tricky indeed for you to see!

But the real question would now be: what time does it take *without* this patch for you? probably it won't be more than a minute anymore. Still, this patch is a good improvement and can/should be kept.