Optimize inotify KDirWatch backend: map inotify wd to Entry
ClosedPublic

Authored by mwolff on Jan 11 2018, 3:11 PM.

Details

Summary

This greatly reduces the on-CPU time of the benchNotifyWatcher.
The issue was that for every inotify event, the list of all
entries would be walked which has abysmal performance when
the map is large. By introducing a direct mapping we greatly
speed things up.

I actually spotted this issue while profiling KDevelop, which
sometimes exhibits similar performance issues.

Running the benchmarks with -perf we can measure the cycles
which trasnaltes to on-CPU time, ignoring the off-CPU time
induced by sleeping and waiting in the tests.

Before:
RESULT : KDirWatch_UnitTest::benchNotifyWatcher():

306,496,490.1 CPU cycles per iteration (total: 3,064,964,902, iterations: 10)

After:
RESULT : KDirWatch_UnitTest::benchNotifyWatcher():

219,120,818.3 CPU cycles per iteration (total: 2,191,208,183, iterations: 10)

Note that the other backends could leverage a similar trick to
speed them up.

Diff Detail

Repository
R244 KCoreAddons
Lint
Automatic diff as part of commit; lint not applicable.
Unit
Automatic diff as part of commit; unit tests not applicable.
mwolff created this revision.Jan 11 2018, 3:11 PM
Restricted Application added a project: Frameworks. · View Herald TranscriptJan 11 2018, 3:11 PM
Restricted Application added a subscriber: Frameworks. · View Herald Transcript
mwolff requested review of this revision.Jan 11 2018, 3:11 PM

@rjvbb this patch just shows that KDirWatch has tons of performance issues that need to be fixed here. Please do run the new benchmarks on your benchmark with your backend, profile them, optimize them.

rjvbb requested changes to this revision.EditedJan 11 2018, 8:29 PM

@rjvbb this patch just shows that KDirWatch has tons of performance issues that need to be fixed here.

That may be an open door but the plural of anecdote is not data. You found 1 example, that doesn't mean that my gripe with the class can be solved easily and purely in there. The issue I have been seeing concerned setting up (feeding) a KDirWatch instance with a potentially huge number of items to watch, and is basically an example of inappropriate use of the class.

Reproducing my benchmark with this patch thus shouldn't change the result because the benchmark protocol involves directories that are not seeing any changes.

This revision now requires changes to proceed.Jan 11 2018, 8:29 PM
rjvbb added a comment.Jan 11 2018, 8:42 PM

Please test this for the QFSW backend too and post a benchmark result comparison (and include the change if beneficial).

The QFSW backend is used on other Unices, Mac and on Linux when the inotify backend cannot be used (NFS, for instance).

src/lib/io/kdirwatch_p.h
224

How much overhead does this give per watched item?

I mean in bytes but I'm guessing there might be a break-even point, a number of entries under which the look-up in the map is more expensive than the current implementation. That could affect applications using KDW to monitor a tiny number of files. If correct, what kind of overhead are we talking about?

dfaure added inline comments.Jan 12 2018, 8:48 AM
src/lib/io/kdirwatch.cpp
721

.insert(e->wd, e) is a hair faster, says Effective STL Item 24 :)

src/lib/io/kdirwatch_p.h
224

Each hash node will be approx 4+8=12 bytes, and something must point to it, so another 8 bytes, 20 in total. No big deal.

rjvbb added a comment.Jan 12 2018, 9:58 AM

Each hash node will be approx 4+8=12 bytes, and something must point to it, so another 8 bytes, 20 in total. No big deal.

I didn't think it'd be a big deal, not under normal circumstances. Still,with applications like KDevelop that currently add every single item in their project trees we're potentially dealing with thousands of files (around 7500 for GCC IIRC, much more for QtWebEngine). Then every byte starts to count esp. if file-change monitoring isn't the only feature that requires gobs of RAM.

150KB is not a lot of memory

markg added a subscriber: markg.Jan 15 2018, 10:13 AM
markg added inline comments.
src/lib/io/kdirwatch.cpp
721

m_inotify_wd_to_entry is a QHash, not a std::unordered_map ;)
Not sure if the same rules from that book apply to Qt (i doubt it).

I'm 100% sure the same rules applies, because the "limitations" of operator[] come from C++ and therefore apply to all associative containers.

mwolff updated this revision to Diff 25441.Jan 16 2018, 8:38 AM

use QHash::insert instead of operator[]

@rjvbb why did you request changes to proceed with this patch? The fact that there are more issues in KDirWatch does not mean we should hold up this approach.

Also, please note how the fundamental issue described here affects more backends beside inotify, but they will need to be handled separately. There are a lot of O(N) iterations over the full map of watched contents, which is inherently a bad idea. This needs to be fixed. I added you to this review so that you can see how one could do it.

@rjvbb why did you request changes to proceed with this patch? The fact that there are more issues in KDirWatch does not mean we should hold up this approach.

That should be obvious, no? You identified an issue but only addressed it in a single backend. As far as I can tell the same approach should work in the QFSW backend so I consider your current patch unfinished. At the very least you should present benchmark results that show the effect on the QFileSystemWatcher backend if they illustrate that this effect is in fact negligible.

Note how I'm not asking things that require installing additional dependencies, write new unittests or even spend hours on it. The change looks trivial enough to refactor and apply to the QFSW backend.

Also, please note how the fundamental issue described here affects more backends beside inotify, but they will need to be handled separately.

I do NOT agree, not without proof that your simple fix does not work for at least the QFSW backend (which is probably the most commonly used across platforms).

Your fix may not be the optimal one for all backends but shunning an improvement just because there might be a better one isn't very productive. Just leave a comment in the code or remark in the commit message if you have reason to believe that things can be improved even more.

I added you to this review so that you can see how one could do it.

And now just imagine reversing the roles. I would get exactly the same orders, probably requiring me to address the issue for all backends even if it required using different approaches.
Not that I'm looking for payback, but I'm not going to drop my request just so no one might thinking I am in fact looking for that.

This code has hardly evolved for years. Now someone with the required theoretical knowledge is looking into improving things. I vote that he tries to let the benefits of his improvement be reaped as widely as possible, while he is at it.

Yeah, I agree. Let's fix it. But I won't write mega patches like you seem to prefer. I try to keep things as minimal as possible. And I also can't give you an ETA or guarantee on when I'll fix the rest. This patch solves one big issues, and the added benchmark lies the foundation for future work. Let's start from here and get going.

But I won't write mega patches like you seem to prefer.

I don't. They happen when it's more cumbersome than self-evident to go back and break up a new feature in baby-step components and justify adding them one at a time.

I try to keep things as minimal as possible. And I also can't give you an ETA or guarantee on when I'll fix the rest. This patch solves one big issues, and the added benchmark lies the foundation for future work. Let's start from here and get going.

Agreed. It's a start, but if you can do the main 2 backends with (more or less) the same fix that'd be great. Apparently you rarely encounter the QFSW backend in your world anyway, so the risk should be small that applying a possibly suboptimal fix for that backend will satisfy you enough to let things rest there.
On the contrary, someone who does use that backend all the time might realise that improvements were possible and step in.

So please do measure the impact of your fix on performance with the QFSW backend.

In D9824#191748, @rjvbb wrote:

So please do measure the impact of your fix on performance with the QFSW backend.

It's zero, since this patch only touches the inotify backend.

rjvbb added a comment.Jan 16 2018, 3:11 PM

I am going to assume you would have said so by now if the fix is not applicable to the QFSW backend because it doesn't do the same costly walk.

> So please MAKE your PATCH TOUCH the QFSW backend AND MEASURE THE IMPACT ON ITS PERFORMANCE

I don't care a rat's ass if that is done in a separate patch or in this one, it just has to be done.

markg added a comment.EditedJan 16 2018, 3:45 PM
In D9824#191929, @rjvbb wrote:

I am going to assume you would have said so by now if the fix is not applicable to the QFSW backend because it doesn't do the same costly walk.

> So please MAKE your PATCH TOUCH the QFSW backend AND MEASURE THE IMPACT ON ITS PERFORMANCE

I don't care a rat's ass if that is done in a separate patch or in this one, it just has to be done.

Why do you demand that of him? Work together and calm down.

Regarding QFSW, i personally doubt there would be any performance improvement for it. QFSW does not emit the details that inotify has. You don't know if a file got deleted, only that "something" changed. The extra overhead of checking what might have changed has some cost and i'm assuming the cost would be bigger then the benefit. (just an assumption here, i could be wrong)

rjvbb added a comment.Jan 16 2018, 7:57 PM

i'm assuming the cost would be bigger then the benefit. (just an assumption here, i could be wrong)

That's why it has to be measured.

I've never seen processing (reaction) times that measure in minutes with either backend (which in turn is assuming that you actually notice such events when you're not looking for it). If that time is indeed spent walking a list that is also walked in other backends, it seems unlikely that the cost could be bigger than the benefit, given the observed gain. The gain could indeed be smaller, but could in fact be even larger if more work needs to be done for each item in the list.

@rjvbb Please mind your language. From my perspective what you are asking of @mwolff here is quite unreasonable - I can't see any reason why incremental improvements, piece by piece would be unacceptable here.

I don't see anything unreasonable to my request, which can also be satisfied by a argued demonstration why the same approach/fix doesn't apply to other backends. AFAIAC I'm just doing my reviewers' service like it's been shown to me how that's done.

If the same approach can improve the 2 main backends (one of them being the cross-platform "default" that's not specific to a minority OS) that opportunity should be seized.

Let me remind everyone that it's often been pointed out to me that I had been wasting more time arguing than it would have taken to do as asked (in addition to remarks about partial fixes and workarounds). Those don't apply just to me.

I'm even going to add a request. Please describe a real-world example where this change actually has a noticeable impact, and please use time rather than CPU cycles as the measure. The observation that KDevelop sometimes exhibits similar performance issues is a little vague. I'm one of the few KDevelop users who have voiced concerns about KDirWatch performance but those are related to the adding of new watch items.

If real-world impact is in the order of a highly significant but millisecond-order reduction of reaction time to file change there may be little reason to commit this improvement now (and then risk forgetting about the rest again). My hunch is that it could be more effective then to keep this change pending and use it as a motivation to work on a more complete overhaul of the class.

A final suggestion: wouldn't it be better to reorganise the code first, splitting off the different backends into different files rather than keeping them all in a single file with all the resulting #ifdefs?

In D9824#192793, @rjvbb wrote:

If real-world impact is in the order of a highly significant but millisecond-order reduction of reaction time to file change there may be little reason to commit this improvement now (and then risk forgetting about the rest again). My hunch is that it could be more effective then to keep this change pending and use it as a motivation to work on a more complete overhaul of the class.

You know how they say, “perfect is the enemy of good.” Also we know at least one person that isn't going to forget about this, don't we? To keep an improvement — and you don't seem to dispute that it is an improvement — pending as “motivation to work on a more complete overhaul” is an argument I can't follow. I guess this is what Uncle Bob calls the Grand Redesign in the Sky.

I'm 100% sure the same rules applies, because the "limitations" of operator[] come from C++ and therefore apply to all associative containers.

The limitation is that with operator[] followed by an assignment the associated value is first value-initialized (here zero-initialized), before later the actual value is written. Since the value has type Entry*, “a hair faster” is quite accurate. I doubt this is even measurable, taking into account the cost of allocation, computing the hash value, and possible cache misses, that are associated with adding a new element.

I'm not saying we shouldn't use insert, but it's probably a micro-optimization.

Also we know at least one person that isn't going to forget about this, don't we?

If you thought it'd be me *who* would keep nagging about this you can forget about that.

and you don't seem to dispute that it is an improvement

Because I simply don't know. It's apparently an improvement for a synthetic benchmark but I've seen too many examples where similar off-the-charts differences in a simple benchmark were completely unnoticeable IRL. The gains might be crucial in realtime applications (but who would use KDW there...), in other cases it might be as underwhelming as the memory overhead that comes with it.

markg accepted this revision.Jan 18 2018, 11:06 PM

Ship it from my side.

dfaure accepted this revision.Jan 18 2018, 11:06 PM

This patch fixes a O(n) performance issue in the inotify backend using a cache, which makes KDirWatch *better* suited for applications that use KDirWatch heavily. Clearly a step in the right direction. Yes a cache needs memory, like all caches, how else is one going to optimize linear searches [in unsorted data]...

I looked at whether other backends have a similar linear search, and only KDirWatchPrivate::checkFAMEvent has something similar, so FAM could use a cache where the key would be the "request number" as obtained with FAMREQUEST_GETREQNUM. That's a different patch though. The other backends don't have such a linear search, why are we even talking of "doing the same for the QSFW backend"? That's not applicable. How about taking a look at the code before requesting to optimize linear searches that don't exist?

The benchmark is not synthetic, it measures exactly the use case of monitoring a large number of files and touching them all (like git checkout anotherbranch would do).

+2 from me.

rjvbb added a comment.Jan 19 2018, 1:19 AM

That's not applicable.

Did you read my actual, original request? If so you'd have seen that it's satisfied by this answer. Also please note the final remark in the author's summary, Note that the other backends could leverage a similar trick to speed them up.

The benchmark is not synthetic, it measures exactly the use case of monitoring a large number of files and touching them all (like `git checkout anotherbranch` would do).

What's the use of an application that watches the amount of files we're talking about here for changes with which it does nothing, other than benchmarking or unit-testing? We don't know how many files we're talking about here (the benchmark results only show the total amount of CPU cycles spent). I suppose it must be a really large number if walking the list takes 7 minutes. I'm pretty certain that simultaneous change of only a fraction of that number of files will cause a very significant CPU load when KDevelop starts clang-parsing them all.

In D9824#193179, @rjvbb wrote:

That's not applicable.

Did you read my actual, original request? If so you'd have seen that it's satisfied by this answer.

Then why didn't you open kdirwatch.cpp yourself to look for m_mapEntries.constBegin() like I did? It's not rocket science, and it would have moved the discussion forward.

Also please note the final remark in the author's summary, Note that the other backends could leverage a similar trick to speed them up.

Correct for FAM.
Doesn't turn out to be the case for the other backends.

The benchmark is not synthetic, it measures exactly the use case of monitoring a large number of files and touching them all (like `git checkout anotherbranch` would do).

What's the use of an application that watches the amount of files we're talking about here for changes with which it does nothing, other than benchmarking or unit-testing? We don't know how many files we're talking about here (the benchmark results only show the total amount of CPU cycles spent). I suppose it must be a really large number if walking the list takes 7 minutes. I'm pretty certain that simultaneous change of only a fraction of that number of files will cause a very significant CPU load when KDevelop starts clang-parsing them all.

clang-parsing is outside of KDirWatch's responsibilities. The point of all this is to make KDirWatch itself faster, isn't it? I thought you found it slow yourself, surely you can't object to Milian working on making it faster, whatever else the applications do after a change notification...

David Faure wrote on 20180119::08:33:32 re: "D9824: Optimize inotify KDirWatch backend: map inotify wd to Entry"

Then why didn't you open XXX yourself to look for XXX like I did? It's not rocket science, and it would have moved the discussion forward.

From previous experience the local culture is that a patch author explains why s/he does things and not certain other things (when asked). I don't always appreciate that when I'm the author but recognise that it's not an unreasonable principle.

Correct for FAM.
Doesn't turn out to be the case for the other backends.

It should not have taken the OA long to establish this (me undoubtedly much longer).
As he's likely to say himself: "Don't assume, measure".

clang-parsing is outside of KDirWatch's responsibilities. The point of all this is to make KDirWatch itself faster, isn't it?

Yeah, and there's a hole in my bucket ... If this makes KDW faster in ways that you can hardly measure under normal use there was little point even to ask for a review.

Seriously, if something as simple as walking a list takes 7 minutes, how long would doing something (undoubtedly more time consuming) with all those file take? That's not a reason NOT to do (micro) optimisations, of course (but maybe a reason not to treat it as something spectacular).

I thought you found it slow yourself, surely you can't object to Milian working on making it faster, whatever else the applications do after a change notification...

Of course, but I wonder why I'm still getting tricked in replying if my actual points aren't being read. What is slow in my experience is feeding a KDW instance. In every other experience I have with runtime behaviour and reaction to file changes the bottleneck is not KDW but whatever happens downstream.

That's another reason why I haven't jumped in; I don't use any application where reaction time to filesystem changes is critical (or even too long to my taste). And if I did I'd be looking at cutting out layers between my code and the notification source.

What is slow in my experience is feeding a KDW instance.

That is part of the benchmark, see KDirWatch_UnitTest::benchCreateWatcher(). So the door is open for further improvements in that area.

aaronpuchert accepted this revision.Jan 19 2018, 8:58 PM

@rjvbb You certainly have a point when you say that there are more deficits to address, but nothing will ever get done unless someone does it. Nobody here is claiming that this solves all problems, but I don't see how it would hinder further improvements down the road, including those that you have sketched.

rjvbb resigned from this revision.Jan 19 2018, 9:25 PM
This revision is now accepted and ready to land.Jan 19 2018, 9:25 PM

This patch fixes a O(n) performance issue in the inotify backend using a cache, which makes KDirWatch *better* suited for applications that use KDirWatch heavily. Clearly a step in the right direction. Yes a cache needs memory, like all caches, how else is one going to optimize linear searches [in unsorted data]...

I looked at whether other backends have a similar linear search, and only KDirWatchPrivate::checkFAMEvent has something similar, so FAM could use a cache where the key would be the "request number" as obtained with FAMREQUEST_GETREQNUM. That's a different patch though. The other backends don't have such a linear search, why are we even talking of "doing the same for the QSFW backend"? That's not applicable. How about taking a look at the code before requesting to optimize linear searches that don't exist?

Note that many other functions iterate over all entries in m_mapEntries:

void KDirWatchPrivate::removeEntries(KDirWatch *instance)
void KDirWatchPrivate::stopScan(KDirWatch *instance)
void KDirWatchPrivate::startScan(KDirWatch *instance, bool notify, bool skippedToo)
void KDirWatchPrivate::resetList(KDirWatch *instance, bool skippedToo)
void KDirWatchPrivate::slotRescan() // twice even!
void KDirWatchPrivate::famEventReceived()
void KDirWatchPrivate::checkFAMEvent(FAMEvent *fe)
void KDirWatchPrivate::statistics()

This is why I claim that more situations can trigger high CPU load when the list is large.

This revision was automatically updated to reflect the committed changes.