two new UDS structures
ClosedPublic

Authored by jtamate on May 2 2018, 11:05 AM.

Details

Summary

The two new UDS structures are similar to Frank's, but instead of using two vectors, use only one, with the index next to the data. The first structure uses linear access and in the autotests is the fastest.
The second structure uses binary access and scales better in the number of fields.

I've added to the last 3 test a new one to measure read time as used in KFileItemPrivate::cmp.

If you like one of the new structures, it can replace the one currently used in KIO::UDSEntryPrivate.

Test Plan

run the test several times

The interesing results in my pc: (using parameter -vb)

0.00056 msecs per iteration (total: 74, iterations: 131072)
0.00032 msecs per iteration (total: 86, iterations: 262144)
0.00053 msecs per iteration (total: 70, iterations: 131072)

0.00036 msecs per iteration (total: 95, iterations: 262144)
0.00014 msecs per iteration (total: 77, iterations: 524288)
0.00048 msecs per iteration (total: 64, iterations: 131072)

0.00041 msecs per iteration (total: 54, iterations: 131072)
0.00021 msecs per iteration (total: 57, iterations: 262144)
0.00048 msecs per iteration (total: 64, iterations: 131072)

Diff Detail

Repository
R241 KIO
Lint
Automatic diff as part of commit; lint not applicable.
Unit
Automatic diff as part of commit; unit tests not applicable.
jtamate created this revision.May 2 2018, 11:05 AM
Restricted Application added a project: Frameworks. · View Herald TranscriptMay 2 2018, 11:05 AM
jtamate requested review of this revision.May 2 2018, 11:05 AM

Thanks for that investigation. Interesting that linear search is faster than binary search in the same data structure... maybe the compiler optimizes it better? Did you profile V2 to find out where the time is spent, or do you have a better explanation?
But even if both were equal performance-wise I'd favor linear search because sorted insertion is easy to get wrong - as this patch shows ;)

When you say "scales better", we're talking about the number of fields in the udsentry, not the number of items. But kioslaves don't fill in 1000 fields, so I have the feeling that scaling with the number of fields isn't a requirement.

Are those benchmarks run in Release (or RelWithDebInfo) mode, rather than Debug (which is a big no no for benchmarks)? Qt should be compiled with optimizations enabled too.

PS: for all alternatives where the API is the same, we could use a template function (templated on the type of UDSEntry implementation) to perform the tests; this would ensure that all tests are actually testing the same thing ;-)

autotests/udsentry_benchmark.cpp
464

m_long should be initialized too (in-place initialized in the member declaration would help not forgetting that)

468

const

487

wouldn't it be faster to assign to index->m_str here?

index->m_str = value;
491

std::vector with emplace_back would be interesting here, performance-wise.

If you keep using QVector, though, then Q_DECLARE_MOVABLE(Field) would help, especially when reserving too small.

498

wouldn't it be faster to assign to index->m_long here?

index->m_long = value;
510

alternative: std::find_if.

560

QVERIFY(equal)

593

const

619

This relies on insert being called in ascending "field" order, for lower_bound to work.
But that is not necessarily the case in kioslaves, so you'd have to insert at "index" here, instead of appending.

634

const

642

const

bruns added a subscriber: bruns.May 2 2018, 3:50 PM

Thanks for that investigation. Interesting that linear search is faster than binary search in the same data structure... maybe the compiler optimizes it better? Did you profile V2 to find out where the time is spent, or do you have a better explanation?

Binary search has a (small) overhead, you can hardly beat linear search when the number of entries is small. When you use binary search, you pay when inserting - find the right position, probably move other items. When you do inserts one item a time, you have O(n^2) behavior.

But even if both were equal performance-wise I'd favor linear search because sorted insertion is easy to get wrong - as this patch shows ;)

When you say "scales better", we're talking about the number of fields in the udsentry, not the number of items. But kioslaves don't fill in 1000 fields, so I have the feeling that scaling with the number of fields isn't a requirement.

Using one list is better than two lists (one heap allocation instead of two), one size check when inserting ...

A good data structure here is one contigous vector storing both key and (small value or d-pointer). Inserting is O(1) when the space has been reserved, lookup is O(n). (For binary search, lookup is O(log n) - for 8 entries, this is == 3, linear search is 8/2 == 4, but no overhead).

I would suggest QVector<QPair<uint, QVariant>>. Each element, taking alignment into account, is 24 bytes.

QMap and QHash are clearly overkill here. The involved pointer chasing ruins any performance benefit from the asymtotically faster lookup.

dfaure added a comment.May 2 2018, 4:04 PM

Thanks for that investigation. Interesting that linear search is faster than binary search in the same data structure... maybe the compiler optimizes it better? Did you profile V2 to find out where the time is spent, or do you have a better explanation?

Binary search has a (small) overhead, you can hardly beat linear search when the number of entries is small. When you use binary search, you pay when inserting - find the right position, probably move other items. When you do inserts one item a time, you have O(n^2) behavior.

I know that but that's not what the code was doing, he is appending in (hopefully) sorted order, so the difference was only at lookup time, where I can't see any overhead due to binary search.

When you say "scales better", we're talking about the number of fields in the udsentry, not the number of items. But kioslaves don't fill in 1000 fields, so I have the feeling that scaling with the number of fields isn't a requirement.

Using one list is better than two lists (one heap allocation instead of two), one size check when inserting ...

Sure, but I'm only comparing "Another" and "AnotherV2", which both use a single vector.

A good data structure here is one contigous vector storing both key and (small value or d-pointer). Inserting is O(1) when the space has been reserved, lookup is O(n). (For binary search, lookup is O(log n) - for 8 entries, this is == 3, linear search is 8/2 == 4, but no overhead).
I would suggest QVector<QPair<uint, QVariant>>. Each element, taking alignment into account, is 24 bytes.

If anyone attempts this, please name the struct and its members, don't use QPair ;-)
But no, that cannot possibly be faster. QVariant has lots of overhead itself.

jtamate marked 9 inline comments as done.May 2 2018, 4:13 PM

If anyone attempts this, please name the struct and its members, don't use QPair ;-)
But no, that cannot possibly be faster. QVariant has lots of overhead itself.

I've tried, just before reading your comment :-)
Three tests: fill the structure, compare two structures and read 3 values.

AnotherV2 (If someone finds a better name, it will be welcome).

0.00041 msecs per iteration (total: 55, iterations: 131072)
0.00022 msecs per iteration (total: 59, iterations: 262144)
0.00048 msecs per iteration (total: 64, iterations: 131072)

QPair+QVariant:

0.00056 msecs per iteration (total: 74, iterations: 131072)
0.00020 msecs per iteration (total: 55, iterations: 262144)
0.00049 msecs per iteration (total: 65, iterations: 131072)
autotests/udsentry_benchmark.cpp
619

Yes, it was badly done. Changed the fill order to detect this problems.

bruns added a comment.May 2 2018, 4:26 PM

Thanks for that investigation. Interesting that linear search is faster than binary search in the same data structure... maybe the compiler optimizes it better? Did you profile V2 to find out where the time is spent, or do you have a better explanation?

Binary search has a (small) overhead, you can hardly beat linear search when the number of entries is small. When you use binary search, you pay when inserting - find the right position, probably move other items. When you do inserts one item a time, you have O(n^2) behavior.

I know that but that's not what the code was doing, he is appending in (hopefully) sorted order, so the difference was only at lookup time, where I can't see any overhead due to binary search.

Even when appending in sorted order, the correct position has to be checked, which is O(n), the appending itself is O(1), as no movement is involved. Appending unsorted is always O(1) (when capacity is sufficient).

When you say "scales better", we're talking about the number of fields in the udsentry, not the number of items. But kioslaves don't fill in 1000 fields, so I have the feeling that scaling with the number of fields isn't a requirement.

Using one list is better than two lists (one heap allocation instead of two), one size check when inserting ...

Sure, but I'm only comparing "Another" and "AnotherV2", which both use a single vector.

The comment was addressing the difference between this approach and the "FrankUDSEntry".

A good data structure here is one contigous vector storing both key and (small value or d-pointer). Inserting is O(1) when the space has been reserved, lookup is O(n). (For binary search, lookup is O(log n) - for 8 entries, this is == 3, linear search is 8/2 == 4, but no overhead).
I would suggest QVector<QPair<uint, QVariant>>. Each element, taking alignment into account, is 24 bytes.

If anyone attempts this, please name the struct and its members, don't use QPair ;-)
But no, that cannot possibly be faster. QVariant has lots of overhead itself.

QVariant has hardly any overhead in this case - the struct here stores one uint and one QString (which is justs its d-ptr), while the QVariant uses one uint for the type info and one union for the data. For both the uint case and QString, the data is stored inline in the QVariant data union.

Using QVariant has the benefit of avoiding a homegrown structure. It is not faster, but it is also not slower.

bruns added inline comments.May 2 2018, 4:32 PM
autotests/udsentry_benchmark.cpp
467

The default == will not only compare the key, but also the value (thats the reason for being slow), so you have completely different behaviour. When comparing only the index, you will update the value for an existing key, otherwise you will append a new value to the same key.

jtamate updated this revision to Diff 33505.May 2 2018, 4:47 PM
jtamate marked 3 inline comments as done.
jtamate edited the summary of this revision. (Show Details)
jtamate edited the test plan for this revision. (Show Details)

Fixed the ordered insertion.
Using std::vector and std::find_if.
Initialize everything to try to detect a change of type in a "insert" over a different type.

Using templates to reduce somehow the code size.
The last 3 structures are tested 3 times:

  • Fill the structure
  • Compare two structures
  • Read 3 values

When you say "scales better", we're talking about the number of fields in the udsentry, not the number of items. But kioslaves don't fill in 1000 fields, so I have the feeling that scaling with the number of fields isn't a requirement.

Yes, I was talking about the number of fields in the udsentry. I had to test it, just in case.

Are those benchmarks run in Release (or RelWithDebInfo) mode, rather than Debug (which is a big no no for benchmarks)? Qt should be compiled with optimizations enabled too.

Yes, since the last comment of D11487 everything is compiled with -O2 -mtune=native
Qt is the one provided by OpenSuse.

jtamate updated this revision to Diff 33507.May 2 2018, 6:08 PM

Another way to improve insertion speed in release could be:
Transform "insert" to really insert only, with ASSERTS if it used as "replaceOrInsert" and add new methods "replaceOrInsert".

The measurements in the first structure has been done commenting the asserts.
before: 0.00039 msecs per iteration (total: 52, iterations: 131072)
after: 0.00033 msecs per iteration (total: 88, iterations: 262144)

bruns added inline comments.May 2 2018, 9:32 PM
autotests/udsentry_benchmark.cpp
503

This comment is still wrong - you want to compare the key only, not the whole entry

Also, you no longer need it, as you use a lambda for the comparision now.

508

non-POD types should not be initialized explicitly

510

unsigned int and -1?

518

bad naming, you use field and class Field for different things

526

bad naming again, index is not an index but an entry or field

dfaure added a comment.May 3 2018, 8:06 AM

I like the replaceOrInsert idea, and the assert in insert... we might have to fix some kioslaves, but in general they have no good reason to insert twice for the same key.

In general I like the std::vector + find_if + emplace_back solution. Fast and readable.

autotests/udsentry_benchmark.cpp
520

storage.constBegin(), storage.constEnd() -- same in next method

521

constEnd() here too

522

storage.emplace_back(field, value);

533

emplace_back

558

call it "it", find_if returns an iterator.

621

constBegin() etc. to avoid detaching
it = ...

645

!index->m_str.isEmpty()

QStringLiteral() without arguments never makes sense, there's no literal...

jtamate updated this revision to Diff 33543.May 3 2018, 9:17 AM
jtamate marked 13 inline comments as done.
jtamate edited the summary of this revision. (Show Details)
jtamate edited the test plan for this revision. (Show Details)

I've included some methods that will be needed if AnotherUDSEntry replaces the current one.
I hope to have addressed all your comments.

jtamate updated this revision to Diff 33607.May 4 2018, 8:10 AM

Removed some methods that are not used in the benchmarks.
Added the asserts to check the type of the data at the beginning of every insert and replaceOrInsert.

FYI: Checking the new implementation in live, so far only two asserts in listjob running kfind.

dfaure accepted this revision.May 4 2018, 8:33 AM

Thanks

This revision is now accepted and ready to land.May 4 2018, 8:33 AM
This revision was automatically updated to reflect the committed changes.