KDevelop: alternative monolithic storage options for TopDUContexts (PoC)
Needs ReviewPublic

Authored by rjvbb on May 7 2018, 5:17 PM.

Details

Reviewers
brauch
Summary

This was discussed on kdevelop-devel initially.

In a nutshell, I am not particularly charmed by the idea that a key/value database of potentially thousands of pairs is stored using individual files in a single directory, in addition using as many mmaps. A comment in the code itself claims that this may be problematic on some platforms. As I said in that ML thread, I am usually all for simple databases that use the filesystem, but get an uncomfortable feeling when that involves this many files (around 6000 for KDevelop's source tree itself). In an application that is supposed to handle lots of files as its main purpose I'd prefer a solution that uses only a single file, without any hard proof that this is better.

On my question if there had been a recent assessment of how suitable modern simple key/value databases are as a backend for this feature, Sven said

I think Milian investigated this once and concluded that no database on the market can do the lookups we need to do in an efficient-enough
manner. Also, this is a huge amount of work to change.

That said, if there is a solution with the same or superior performance and somebody is willing to do the porting, I'm all for it ...

I had been looking for an excuse to get some experience with database libraries like LMDB so started to work on an approach that would hopefully not be a "huge amount of work". The result is attached, with working backend implementations for the original QFile-based approach, LMDB (with the lmdbxx C++ bindings and additional LZ4 value compression), LevelDB and (LZO-compressed) KyotoCabinet (the latter 2 deactivated).

The backend implementation is based on the QFile API, with a very thin wrapper around QFile or more complex classes for the 3 database backends; the desired backend is currently selected via a preprocessor token in topducontextdynamicdata_p.h (requiring a rebuild of only 2 files from KDevPlatformLanguage).

I've been doing extensive testing in real-life and using the test_topcontextstore unittest/benchmark, and think the patch is now at a stage where it can be presented to the world as (at least) a proof of concept.

In practice, the LMDB backend corresponds best to what I was hoping to achieve: minimal overhead compared to the existing implementation (at least on Linux) with potentially an increase in disk efficiency (in addition to using a single file) and possibly a performance gain on slow media. In short, comparing to the file-based backend: writing TopContexts is slower but not noticeably so in real life; reading is comparably fast (even in the benchmark) on local storage. On slow (remote) storage, writing can be comparably fast while reading is as fast as on local storage. (I'm not really certain how to explain that, btw.)
More complete benchmark results (on Mac and Linux) are included in test_topcontextstore.cpp. The LZ4 compression gives up to a 2x decrease in average of the stored value sizes with a negligible impact on performance (probably because the benchmark first fills the database file with uncompressable random values).

Re: LMDB and NFS: there is a known issue with LMDB on NFS shares. As far as I understand, this concerns different use cases (shared databases) than the one at hand (a private databased used by a single application at a time). The unittest has not shown any problems with XDG_CACHE_HOME points to a directory on an NFS share.

I have not tested LevelDB extensively as it can use multiple (many) files too. In addition, LevelDB builds with TCMalloc by default which can (and has) lead to issues (deadlocks).

I looked into KyotoCabinet because the LevelDB performance notes mention it can be faster for random read access. In my implementation it does not at all live up to that reputation.

For now I have been keeping those backend implementations for reference (cum potential "junk DNA").

Test Plan

Tested on Mac and Linux in real-life usage as well as with the unittest/benchmark. Does exactly what I was looking for.

I have been using the lmdbxx C++ binding; at some point I might simply include a stripped-down/tailored version of it instead.

Diff Detail

Repository
R32 KDevelop
Lint
Lint Skipped
Unit
Unit Tests Skipped
rjvbb requested review of this revision.May 7 2018, 5:17 PM
rjvbb created this revision.
mwolff added a subscriber: mwolff.May 7 2018, 7:14 PM

so what's your plan with this now? Do you actually want to move this forward? Then we need to find a good solution that works everywhere. The numbers aren't that useful on their own, as there the current implementation which is trivial and has no external dependencies is apparently still the best by a margin...

I'm really quite unsure if we want to go down this route. If we'd decide to do that, then it would need to be done also in the other places, most notably to replace our actual performance sensitive databases. The current topconteext data loading is pretty fast. What's slow, or at least used to be slow, is hammering the big databases we have (i.e. AbstractItemRepository). *These* are the slow ones, since they are actually used a lot. Still, the current implementation was much faster than lmbd the last time I tried. The reason is that we can access and insert quickly, and lookup is O(1) since it's an on-disk hash map. But actually, I would very much appreciate getting *rid* of that complicated code and replacing it by something else. The reason is that we still have lots of crash reports in that code that are quasi impossible to debug/reproduce. Also, if we'd use an established data base implementation, we could potentially make access cross-process safe, which I would really like to see in the long term.

So before I do any review, we first need to talk about the above.

kdevplatform/language/duchain/tests/test_topcontextstore.cpp
77

you are missing the 6k results here

146

that's pretty bad in comparison

note that we currently just rewrite the complete file, which you can't do when you use one big data base

356

O_o make it a separate data file and read it from a resource instead of embedding it manually

rjvbb added a comment.May 7 2018, 9:21 PM
so what's your plan with this now?

I plan to continue to use this for topcontexts myself, maybe extend it to the other DUChain stores if/when I get around to understanding that code. The topcontexts storage was easy enough to tackle, and as mentioned, my main goal here was to get rid of the file-based implementation (and numerous mmaps) without taking a performance hit. I wasn't aiming to or expecting to improve topcontext performance, that's a misunderstanding I already tried to set right on the ML.

One reason to publish the patch was clearly to make it available as a start for others to toy with (for lack of a better word).

The numbers aren't that useful on their own, as there the current implementation which is trivial and has no external dependencies is apparently still the best by a margin...

For writing to and deleting from a local cache directory, indeed, but those are operations that occur only infrequently and rarely with the full range of indices. You've mentioned yourself that topcontext I/O is not what bounds performance, so the difference in read speed should be negligible when loading or parsing a project in KDevelop (and my experience confirms that; even the KyotoCabinet backend doesn't really feel slower).
I'd expect that the better performance of LMDB on NFS shares might well be drowned out too if everything else is loaded over NFS too.

That said, the benchmarks time serial access. I've deliberately run them on a system that has my usual other tasks running, but not inside a KDevelop session that's doing all kinds of other disk access. Depending on your disk and filesystem that could make a significant difference on performance of the current implementation, but I doubt it will be easy to quantify that.

TBH, it took me a while before I got the database benchmarks to give consistent results esp. for writing, and not timings that fluctuated between much faster and much slower than the current values. See my remark above about others "playing" with this.

hammering the big databases we have (i.e. AbstractItemRepository). *These* are the slow ones, since they are actually used a lot.

Is that the database that's scattered in the other files, 1 directory above the topcontexts directory?

Still, the current implementation was much faster than lmbd the last time I tried. The reason is that we can access and insert quickly, and
lookup is O(1) since it's an on-disk hash map.

The KyotoCabinet backend I tested also uses an on-disk hash map with O(1) lookup - according to the documents. I found its API the cleanest of the 3 I tried. Maybe it'll prove faster in other applications.

But actually, I would very much appreciate getting *rid* of that complicated code and replacing it by something else.

That certainly wouldn't hurt. I've managed to reduce the number of crashes and deadlocks with a patch that treats symptoms but a complete overhaul seems a good idea (as well as a lot of tedious and not very gratifying work). But would a simple key/value database be sufficient (rather than e.g. sqlite3)?

access cross-process safe, which I would really like to see in the long term.

A central cache so different sessions that open the same file don't have to generate and maintain their own data? I think Xcode caches (as in: hides) something like that somewhere, which can become really huge (probably because it isn't well defined when you can/should remove cached data).

kdevplatform/language/duchain/tests/test_topcontextstore.cpp
77

I can add them of course, but from what I've seen they scale in a nicely linear fashion. My initial tests with the file-based backend showed non-linear scaling there which has now all but disappeared so I could just as well remove the 6k results, probably.

146

That's removing all entries with a flush after each remove (which should be redundant).

There *is* original code that calls QFile::remove(), hopefully only when the index is not going to be reused immediately. Rewriting an entire file becomes overwriting an existing key/value pair, which all databases tested support in much the same way it is done with files.

356

I thought about using a qresource but the extra work seemed a bit pointless as it wouldn't reduce the size of the patch. I'd like to avoid binary files for now (until this actually begins to stand a chance to be committed).

rjvbb added a comment.EditedMay 7 2018, 9:36 PM

EDIT: while C++ exceptions are apparently with almost zero cost as long as they don't occur (https://stackoverflow.com/questions/13835817/are-exceptions-in-c-really-slow) there must also be a reason why they're disabled by default for KF5. Should I expect a benefit of rewriting lmdbxx so topducontextdynamicdata_p.cpp doesn't require exception support?

rjvbb updated this revision to Diff 33823.May 8 2018, 2:25 PM

This revision brings some code cleanup (including the use of a qresource in the unittest/benchmark) and above all, a signficant performance improvement.
It turns out that the default LMDB environment uses synchronous I/O. In our case we can take the risk to use async I/O because we would at most lose the last few changes if a crash occurs - and most of the time the entire cache will be deleted after a crash anyway.

This change makes the LMDB backend almost as fast as the file-based backend on Mac.

rjvbb set the repository for this revision to R32 KDevelop.May 8 2018, 2:27 PM
rjvbb marked 3 inline comments as done.
rjvbb edited the summary of this revision. (Show Details)May 9 2018, 9:36 AM
rjvbb added a comment.May 18 2018, 8:06 AM

How about I clean this up a little (removing at least the KyotoCabinet backend) and then push it to a branch so it's easier to pick up for anyone else who like to work on this or even just check it out?

I'm still using the 5.2 branch myself but I'm hoping that this part of the code is "well preserved" enough that the patch applies without too much hassle.

rjvbb updated this revision to Diff 38266.Jul 23 2018, 4:04 PM

Refactored for the current master branch

rjvbb added a comment.Jul 23 2018, 4:07 PM

Patch for the 5.2 branch (fully tested):

rjvbb retitled this revision from KDevelop: alternative monolithic storage options for TopDUContexts (WIP/PoC) to KDevelop: alternative monolithic storage options for TopDUContexts (PoC).Nov 13 2018, 10:39 AM
rjvbb edited the test plan for this revision. (Show Details)
brauch requested changes to this revision.Nov 13 2018, 10:50 AM

Sorry, I don't see this going in in this or a similar form.

  • What Milian said: the file storage is not the core of the problem neither performance- nor complexity wise.
  • There is no way we are going to merge a patch which implements 4 different storage options for the user to select from. That is a support and maintenance nightmare.
This revision now requires changes to proceed.Nov 13 2018, 10:50 AM

Ah, yes, I'll have to remember to clean it up and keep only the LMDB backend.

File storage *was* the core of *my* problem and the raison d'être for this patch (IIRC I posted some quick results showing that it does improve performance on NFS home directories).

rjvbb updated this revision to Diff 45462.Nov 14 2018, 3:56 PM

Patch cleaned up, stripped the LevelDB and Kyoto backends that never satisfied me.
I did leave the original file-based storage backend, not because I think it has to be preserved if this ever gets in but to provide a quicker way to compare performance (and behaviour if ever someone testing this runs into issues).

rjvbb set the repository for this revision to R32 KDevelop.Nov 14 2018, 4:03 PM
rjvbb marked 2 inline comments as done.Nov 14 2018, 4:16 PM
rjvbb updated this revision to Diff 68453.Oct 21 2019, 4:11 PM

Rebased for the 5.4 branch. Still working perfectly for me, without noticeably slower reaction times on local filesystems.

As to overhauling the entire duchain cache on-disk organisation and storage model: would Cap'n Proto (https://capnproto.org/) provide an alternative? I understand that stuff (for lack of a better word) is "insanely fast" but don't really see where else it could be used besides replacing JSON or "Protocol Buffers". It supports backwards & binary compatible updates to the storage model though, which should be useful.