Implement client-side threading cache
ClosedPublic

Authored by dvratil on May 18 2016, 11:12 PM.

Details

Summary

The cache is a simple QHash<ChildId,ParentId> that we build when folder is
opened for the first time and persist it in a cache file for each folder.
Aggregation state (grouping and threading) is also stored in the cache
file so we can check whether the cache matches the current aggregation
configuration before we load it. If the aggregation has changed we simply
discard the cache file and perform full un-cached threading.

There is a second QHash<ItemId, Item*> in ThreadingCache which we fully
populate in Pass1. Pass1 may already yield some perfect matches thanks
to the Child-Parent cache, but only if the Parent Item* has already been
inserted into the second QHash - this does not happen much as we retrieve
Items in reverse order so children will arrive before their parents. If
we can't do a perfect match but we have the Parent ID available in cache
we just move the Item to Pass2 and go on to next one. Otherwise we let
Pass1 do full evaluation which will insert the Item to cache if perfect
match is found or queue it for Pass2.

In Pass2 we have the second QHash fully populated so we perfect-match
all Items using the ChildId->ParentId cache. Items that are known to not
have a parent (i.e. thread leaders) are scheduled for Pass4, Items that
are not available in cache are sent for full evaluation through Pass2
(and Pass3 if needed) and inserted into the cache.

Finally Pass4 performs grouping. There is no caching for that right now
because the grouping is dynamic and there are no real stable identifiers
for group headers. We could possibly cache all the fixed groups (i.e. sender,
receiver or subject) and maybe even .fixed-date groups (e.g. "January 2014",
"March 2015") and only let Pass4 calculate dynamic groups ("May",
"Two weeks ago", "Yesterday", ...) but the gain here would be minimal as we
are usually dealing with very few groups. The real bottleneck of Pass4
is beginInsertRows()/endInsertRows() as threads are shifted around - something
to look into next.

On my system this speeds up opening a folder with 50000 emails by ~30%.

Diff Detail

Repository
R94 PIM: Message Library
Lint
Lint Skipped
Unit
Unit Tests Skipped
dvratil updated this revision to Diff 3869.May 18 2016, 11:12 PM
dvratil retitled this revision from to Implement client-side threading cache.
dvratil updated this object.
dvratil edited the test plan for this revision. (Show Details)
dvratil added reviewers: mlaurent, vkrause.
dvratil set the repository for this revision to R94 PIM: Message Library.
Restricted Application added a project: KDE PIM. · View Herald TranscriptMay 18 2016, 11:12 PM
Restricted Application added a subscriber: kde-pim. · View Herald Transcript
dvratil updated this revision to Diff 3871.May 19 2016, 12:02 AM
dfaure edited edge metadata.May 19 2016, 6:53 AM

Well done for having the courage to touch the messagelist code :-)

I can't really review the core logic, but here's some feedback mostly on the cache file handling.

messagelist/src/core/model.cpp
762

This reads strange... What is the purpose of save() followed by load()? It sounds like reloading what we just saved, i.e. the load being a no-op. I'm guessing this isn't the case, but then it's worth a comment or renaming....

messagelist/src/core/threadingcache.cpp
41

Why is this instance a global variable? It looks like it's only used in one method.

63

If locate() can't find the file, it will return an empty string, which will lead to

  • a runtime warning from QFile about the filename being empty
  • unusable debug output only showing empty strings

I recommend splitting the locate() and the QFile ctor -- and then you can remove the if !exists check, locate already checked that for you.
Instead, check for isEmpty on the string, and possibly add the relative path (/messagelist/threading/%1) to the debug output.

74

This warning should also display the path to the file, so people can find out which dir they messed up permissions on (e.g.)

78

This isn't byte-order-independent or architecture-independent (32/64 bits), unlike QDataStream which you use for the rest. Not sure it's real problem though; in the unlikely event that someone copies his home dir from a 32 bit computer to a 64 bit computer, he should hit your "unknown version" code path, hopefully ;-). But maybe using QDataStream would be safer and more consistent?

99

s/Failed/Unusable cache/. Failed in debug output sounds like a bug that has to be investigated, while this is a normal condition here.

164

The QFile destructor closes the file, so this line is unnecessary.

dfaure added inline comments.May 19 2016, 6:54 AM
messagelist/src/core/threadingcache.cpp
137

Should this use QSaveFile to avoid partially written out files in case of full partition? On the other hand I suppose not using QSaveFile is slightly faster so if we can handle truncated files gracefully....

dvratil marked 6 inline comments as done.May 21 2016, 1:13 PM

Thanks for the review. I'd really appreciate someone checking the logic in the Model itself, but I realize that might be a bit time consuming. I did my best to understand the threading code (it looks more complicated than it really is), but of course I could've missed some edge-case scenario.

messagelist/src/core/threadingcache.cpp
41

Copy-paste mistake.

137

I added flush() after writing the CacheHeader, so if the file gets truncated, we will detect it while loading it the next time because we will reach EOF before reading CacheHeader::cacheSize records.

dvratil updated this revision to Diff 3919.May 21 2016, 1:14 PM
dvratil edited edge metadata.
dvratil marked an inline comment as done.

Addressed David's comments.

dvratil added inline comments.May 21 2016, 1:16 PM
messagelist/src/core/model.cpp
762

The in-memory cache gets dynamically updated when new emails arrive and are threaded by the model (or simply when you open a folder there's no cache for yet). When new folder is opened, Model::setStorageModel() is called and that's when we save() the in-memory cache of the old folder threading into a file, and then load() threading cache of the newly-selected folder.

Yeah... I can't find the courage to dig into the actual threading code, sorry.

Are there unit tests for this logic BTW?

messagelist/src/core/model.cpp
762

I see. Can you add a "// save cache for old folder" comment, for the next dumb reader like me? :-)

Also.... what about the first time? Shouldn't save only be done if there *is* an old folder?

763 ↗(On Diff #3919)

What if the user chose no threading, i.e. flat list?
Does the cache bring anything to the picture in that case, or just overhead?
AFAICS it's just overhead, writing out the full list of all items with a null parent into the cache. So I think the cache should be skipped if no threading is happening.

I use flat list in many folders, I'd like this opportunity while it's fresh in your head to ask if anything can be done to make that fast too :-)

messagelist/src/core/threadingcache.cpp
97 ↗(On Diff #3919)

Missing space before =

125 ↗(On Diff #3919)

No qCDebug in this case? All other return paths have one...

dvratil updated this revision to Diff 3922.May 21 2016, 11:06 PM

Addressed David's comments

dvratil marked 4 inline comments as done.May 21 2016, 11:15 PM
dvratil added inline comments.
messagelist/src/core/model.cpp
763 ↗(On Diff #3919)

Good point. I extended the logic a bit and completely disabled the cache for no-threading case.

I use flat list in many folders, I'd like this opportunity while it's fresh in your head to ask if anything can be done to make that fast too :-)

It basically boils down to the Model emitting beginInsertRows()/endInsertRows() for every single item. Disabling grouping should speed it up a little bit more. I'm looking into it now, I basically want to create some sort of insertion aggregator that would be able to insert items and emit beginInsertRows() in larger batches. However the code relies heavily on inserted items being immediately available in the internal tree (and the view as well, for auto-expanded threads), which makes this a tricky task

dfaure added inline comments.May 21 2016, 11:15 PM
messagelist/src/core/model.cpp
761

is there a newline after d here, or is phabricator on crack?

768

More generally this could say if (d->mThreadingCache.isEnabled() && !isReload)
This way the logic of when to enable/disable cache is in a single place (a bit below), and we never risk calling save() with the cache disabled. Alternatively you could add an early return in save if !mEnabled (which would also address my other comment).

messagelist/src/core/threadingcache.cpp
74

If you go to a folder with threading, then to one without threading (-> saving old folder, and then mEnabled=false), and then you destroy the ThreadingCache, this will save again the cache for the old folder, right? I think this should test for mEnabled.

dvratil updated this revision to Diff 3924.May 21 2016, 11:24 PM
dvratil marked an inline comment as done.
dvratil marked 3 inline comments as done.
dfaure accepted this revision.May 22 2016, 8:15 AM
dfaure edited edge metadata.

Looks good to me. As to the core threading logic, I strongly recommend checking for existing unittests or adding new ones.

This revision is now accepted and ready to land.May 22 2016, 8:15 AM
This revision was automatically updated to reflect the committed changes.