KDevelop (full) project parsing: defer until all projects have been loaded.
AbandonedPublic

Authored by rjvbb on Sep 8 2017, 11:17 PM.

Details

Reviewers
kfunk
Group Reviewers
KDevelop
Summary

KDevelop currently starts the project parse run when the project has been imported. That's perfect when using single-project sessions, but it means that the import of each subsequent project has to compete with the parser. Project import is a process that can block the UI and should thus take as little time as possible.

This patch introduces a simple queue that holds all newly imported projects until all the imports have been completed; they're handed off to the parser only once the actual import has completed.

Test Plan

Works as intended in 5.1.2; patch applied cleanly to the 5.2 branch.

This is a change I've been thinking about for a long time; the initial impression is of improved responsiveness (even with "parse all project files" turned off).

I didn't see a need to make m_pendingParseProjectJobs a FIFO or FILO, is there?

Diff Detail

Lint
Lint Skipped
Unit
Unit Tests Skipped
rjvbb created this revision.Sep 8 2017, 11:17 PM
Restricted Application added a subscriber: kdevelop-devel. · View Herald TranscriptSep 8 2017, 11:17 PM
rjvbb updated this revision to Diff 19328.Sep 8 2017, 11:17 PM

Forgot to remove some debug output.

rjvbb set the repository for this revision to R32 KDevelop.Sep 8 2017, 11:18 PM
brauch added a subscriber: brauch.Sep 8 2017, 11:40 PM

Do you have some timing on how much loading time improves with this patch?

arrowd added a subscriber: arrowd.Sep 9 2017, 6:58 AM

but it means that the import of each subsequent project has to compete with the parser.

Can you please elaborate on this? I didn't get how importing distinct projects can affect each other.

rjvbb added a comment.Sep 9 2017, 9:02 AM
Can you please elaborate on this? I didn't get how importing distinct projects can affect each other.

Once the first project of a series has been imported completely it starts to parse. That means a lot of file I/O and that's the level where the competition occurs. The parser can be quite RAM hungry so there may also be competition at that level.

I haven't yet had the opportunity to do timing comparisons and expect they'd be tricky to get. What I can try to do is start an elapsed timer when the 1st project starts importing and print its output via qCDebug when the last project has been imported. Combined with an env. var check to go back to the non-deferred start. That way you can get an idea yourself with minimal effort.

Do any of you ever stop the parser because it's just not justified to let it burn CPU going through all those (unchanged) files again? With this mod it should become a bit easier to do that at the earliest possible moment.

FWIW, my initial idea was to use the Background Parser/Delay value here too but that appears to require additional and unjustified complexity.

arrowd added a comment.Sep 9 2017, 9:54 AM
In D7745#144194, @rjvbb wrote:
Can you please elaborate on this? I didn't get how importing distinct projects can affect each other.

Once the first project of a series has been imported completely it starts to parse. That means a lot of file I/O and that's the level where the competition occurs. The parser can be quite RAM hungry so there may also be competition at that level.

Ok I get it. FWIW, Visual Studio does exactly this - it first loads all the projects and only then starts parsing. Now that KDevelop got merged with KDevPlatform, the project count in the generated solution reached ~500. Loading such solution takes minutes on my machine, so I imagine it would take really longer if there would be a parser running in the background.

rjvbb abandoned this revision.Sep 9 2017, 5:19 PM

Merged with the companion revision https://phabricator.kde.org/D7742

rjvbb added a comment.Sep 16 2017, 8:30 AM

Now that KDevelop got merged with KDevPlatform, the project count in the generated solution reached ~500. Loading such solution takes minutes on my machine, so I imagine it would take really longer if there would be a parser running in the background.

It seems that using cmake server mode is also to blame for long project import times. On my Linux rig the same project (the 5.2 branch of kdevelop itself) takes over 4x longer to import in a build of that 5.2 branch vs. an otherwise identical build of 5.1/head. Specifically, the pure import phase (+ parsing 10 or so open files; creating dirwatchers deferred and full project parsing turned off) takes about 1 minute in v5.1.40 and 12-14 seconds in v5.1.2.5 .

I hope that whatever benefits come from using cmake server mode justify that cost. If I interpret the code correctly there still is a fallback to the old import method for cmake versions that don't support server mode.

If there's no way to speed up server-mode import (for instance by reducing the sh*load of messages that cmake sends that are being ignored) then maybe we should think of a way to bypass it. It's one thing if you have to wait "a bit" longer for a project to load if you're going to be doing serious development on it and can thus use all extra metadata (or whatever) possible. It's something else altogether having to wait that much longer if you just want to open open for a quick peek while working on another project.

Maybe a "Quick Import" feature could make sense (action in the Project menu and/or a checkbox in the open file dialog when starting an import), there might be other projectmanagers that have a mode where they take significantly less time?

R.

I just poked around a bit myself. It indeed takes ages to import the project, and I couldn't figure out why ... it is certainly not CPU load. I looked in perf, the CPU time spent in importing the project is quite low.
In fact, the code model generation (which seems to be the last step in the import AFAIU) finishes relatively quickly here (after something like 5 seconds), but then it still takes like 30 seconds for the import to finish. I don't know where that time is spent.

This is a bug, something is waiting too long for something else. Let's find the bug instead of introducing checkboxes to work around it.

Ok, here's the two things that are broken and need fixing:

  • FileManagerListJob has a queue of directories to list. It processes those sequentially, and after listing each, it goes through the event loop (async slot invocation) to list the next. This takes aaaaaages for lots of directories to list, which are mostly spent waiting and not in actual CPU time. When doing this synchronously as a test, it takes maybe half a second instead of 30 seconds.
  • The reason this is noticeable now is that for some reason, it lists all the build dirs. This is probably a bug in the new cmake importer, they should not be listed.

Rene, maybe you want to look into this? Fixing the second point would solve the problem, but fixing the first one somehow (just making it sync isn't a solution, but there is certainly a better way) would also be very nice since it will further speed up the import.

Greetings!

Or rather, let me rephrase: The first point is IMO the much worse bug, since that is what will cause large project imports (think clang) be very slow in either case. For completely no reason whatsoever.

Rene, maybe you want to look into this?

This sounds like it could well be something requiring a substantially deeper understanding of the underlying architecture than I have (and wouldn't normally seek out just for fun ;))

" When doing this synchronously as a test, it takes maybe half a second instead of 30 seconds."

This sounds like something that's independent of whether or not cmake server mode is used. If so, where does the 4x cmake server mode overhead I'm seeing come from?

"it lists all the build dirs"

Who/what is "it" here? Can you confirm (or check) if this also happens for an out-of-tree build dir?

see comments on https://phabricator.kde.org/D7745 for why it's actually slow

Which ones?

As to not deferring: I continue to think it makes sense and will also make it a lot easier to isolate performance issues in the future, if/when we have addressed the current ones. As far as my timing results indicate (in the 5.1 branch), deferring doesn't change the total import time significantly under optimum conditions. It wasn't actually meant to do that either, the main goal was to hog the main event loop for as little time as possible during an import.
As to making dirwatching optional: I agree the ideal solution would be to fix the feature (to remove the argument "don't waste cycles on something when it doesn't work anyway") but I'm not convinced it can be improved to the point where it takes a negligible amount of time on all platforms. Need I repeat that the main issue that triggered this diff was the fact I had to wait for minutes with the main thread running at 100% CPU outside the event loop?

Seriously, how many of us use KDevelop to work on projects the size of GCC or llvm and need dirwatching on the whole tree? How expensive is that on Linux, *after* the initial import?

Maybe the proposed checkbox to en/disable dirwatching should be coupled with a project-level per-directory setting, "watch this directory for changes", hidden simply (?) in the directory context menu and/or the "project filter" settings.

Come to think of it, this is really something that can be compared to mail folder synchronisation. In KMail you can indicate for each folder if it has to be included in the scheduled email check (the version I use supports IMAP IDLE only on the inbox), but a sync will be performed each time to select a folder. In KDevelop this could be coupled to directory expansion in the project manager. I don't know if that's doable API-wise, but I like the idea, it sounds a lot quicker than doing a right-click and selecting the "Reload" action.

R.

In D7745#146271, @rjvbb wrote:
Rene, maybe you want to look into this?

This sounds like it could well be something requiring a substantially deeper understanding of the underlying architecture than I have (and wouldn't normally seek out just for fun ;))

Well, you gain understanding in the process, don't you ;)
I didn't understand it either before I looked at it.

" When doing this synchronously as a test, it takes maybe half a second instead of 30 seconds."

This sounds like something that's independent of whether or not cmake server mode is used. If so, where does the 4x cmake server mode overhead I'm seeing come from?

cmake server mode seems to (wrongly) include the build dirs, at least here.

"it lists all the build dirs"

Who/what is "it" here? Can you confirm (or check) if this also happens for an out-of-tree build dir?

It probably doesn't happen. 'It' = the cmake project plugin which decides which folders are part of the project being opened.

rjvbb added a comment.Sep 16 2017, 2:12 PM
Well, you gain understanding in the process, don't you ;)
I didn't understand it either before I looked at it.

Heh, like the understanding of boot making I gained last year or so. Doesn't mean I'm now able to make or completely re-sole them myself (or interested in doing so) ;)

> Who/what is "it" here? Can you confirm (or check) if this also happens for an out-of-tree build dir?
It probably doesn't happen. 'It' = the cmake project plugin which decides which folders are part of the project being opened.

It could be useful to be sure as part of an answer to the question where the plugin gets the information about the build dir from. So if you have the possibility to rerun those tests you mentioned with a project configured to use an out-of-source build directory, please do. It will be a few days before I'll be able to delve this deep into the cmake plugin anyway.

In D7745#146271, @rjvbb wrote:

Seriously, how many of us use KDevelop to work on projects the size of GCC or llvm and need dirwatching on the whole tree? How expensive is that on Linux, *after* the initial import?

Watching directories is quite essential when working with Git or other SCMs. When checking out another branch, I want KDevelop's file tree to be updated as well.

I think the main problem here is that KDirWatch uses vastly different implementations. As it says in kdirwatch.h:

The implementation uses the INOTIFY functionality on LINUX. Otherwise the FAM service is used, when available. As a last resort, a regular polling for change of modification times is done; the polling interval is a global config option: DirWatch/PollInterval and DirWatch/NFSPollInterval for NFS mounted directories.

I guess inotify is quite efficient as an API for exactly that purpose, while the fallback methods are not. So maybe it makes sense to have some size restriction based on the method used in KDirWatch, which can be obtained via KDirWatch::internalMethod(). Of course it takes some time to open a large file tree even on Linux, but no longer than other IDEs I know of.

Maybe the proposed checkbox to en/disable dirwatching should be coupled with a project-level per-directory setting, "watch this directory for changes", hidden simply (?) in the directory context menu and/or the "project filter" settings.

Come to think of it, this is really something that can be compared to mail folder synchronisation. In KMail you can indicate for each folder if it has to be included in the scheduled email check (the version I use supports IMAP IDLE only on the inbox), but a sync will be performed each time to select a folder. In KDevelop this could be coupled to directory expansion in the project manager. I don't know if that's doable API-wise, but I like the idea, it sounds a lot quicker than doing a right-click and selecting the "Reload" action.

That works if your main method to open files is to use the project manager. However, some people prefer opening files via the Quick Open bar. In that case you want to watch the entire tree, otherwise new files might be invisible for a long time.

The difference between IMAP and our situation is that in the former case there is a network connection (usually WAN), while in the latter case all files are typically local.

rjvbb added a comment.Sep 17 2017, 2:46 PM
Watching directories is quite essential when working with Git or other SCMs. When checking out another branch, I want KDevelop's file tree to be updated as well.

My proposition doesn't make that impossible. It just makes it optional.

Of course that option could be hidden/omitted on platforms where it doesn't make sense. I have no issue with that, but didn't propose it because I know many do have a grip with platform #ifdefs. I could of course disable and hide the checkbox on Linux if KDirWatch uses inotify. I don't really want to go the direction of doing this as a function of project size: too machine-specific and personal (and thus random).

Unless we replace the checkbox with a file limit (where 0 would mean "always off")?

That works if your main method to open files is to use the project manager. However, some people prefer opening files via the //Quick Open// bar. In that case you want to watch the entire tree, otherwise new files might be invisible for a long time.

I work mainly on a platform where dirwatching doesn't work anyway, and I get by. It's not a big deal to do a (partial) reload of the project tree. Annoying sometimes, but I don't have any other choice anyway.
(Reminder, dirwatching did work in KDevelop4 but ate file descriptors so had to be limited.)

The difference between IMAP and our situation is that in the former case there is a network connection (usually WAN), while in the latter case all files are typically local.

Yes, the cost of continuous syncing is in different things, but it can still be a cost that is best avoided.

Does anyone know what the other costs are (on Linux), not in terms of project import duration but runtime costs that add up to become significant above a certain project size (memory, interrupts, file descriptors, gnomes, moonrays, whatever)?

mwolff added a subscriber: mwolff.Sep 18 2017, 1:13 PM

The difference you describe between cmake server mode vs. no server mode once again shows that you have not yet fully grasped what's going on here and that this patch as it stands is not the right fix. Find out why the cmake server mode degrades the performance, than fix that. You say:

Does anyone know what the other costs are (on Linux), not in terms of project import duration but runtime costs that add up to become significant above a certain project size (memory, interrupts, file descriptors, gnomes, moonrays, whatever)?

No, measure it. Use a profiler like perf or hotspot (or instruments) and figure it out.

rjvbb added a comment.Sep 18 2017, 2:42 PM

Sorry, but no, I think it's you who are following things diagonally. The extra overhead due to cmake server mode is orthogonal to the overhead due project parsing and installing dirwatchers, or so it should be. Import time increase due to concurrent parsing and dirwatcher creation can (will) happen with all project types. Deferring those features to after project import can only make it easier to understand what causes the cmake server mode overhead, but that's not something I am particularly interested in (so you're right about my grasp of it).

I'll say it again, deferring the start of features that work in the background just makes sense. It's highly unlikely that the project tree will change outside of KDevelop while being imported, and it is not the parsing of the initially opened document that's being delayed. The only effect of deferring a user might (will) notice is that project opening is a bit snappier, i.e. s/he can start to work a bit sooner. That should be enough justification for a change that really isn't that complicated.

No, measure it.

Happy to oblige - if someone pays me for the investment that will require.

I have a rough but largely sufficient measurement of the possible cost of dirwatchers on Mac: 15 minutes extra import time. That makes any concerns about the runtime costs moot, esp. since dirwatching doesn't work anyway, or hardly.
With the projects I can tackly on my Linux system the import overhead is negligible and other runtime costs in terms of things I can measure hidden in the overall noise.

I'll say this again too: deferring dirwatch creation makes as much sense on Linux as elsewhere but I'm fine with #ifdeffing the code making dirwatching optional. Or disabling it on Mac until someone figures out how to get it to work reliably.

Rene, did you actually read the comments I made on the other RR? I'm fairly sure that is the result an in-depth performance analysis will show too, and I'm faily sure your workaround is not needed if the performance *bugs* this code has are fixed ...

rjvbb added a comment.Sep 18 2017, 7:28 PM
Rene, did you actually read the comments I made on the other RR?

Yes, I did. I didn't re-read them since, I'll do that when I get around to doing something with your suggestions (which will be with deferral in place).

I'm fairly sure that is the result an in-depth performance analysis will show too, and I'm faily sure your workaround is not needed if the performance *bugs* this code has are fixed ...

Which workaround, deferral or disabling dirwatching?

Even if your hunch is right "not needed" still doesn't mean "not a good idea". It's already been pointed out that MSVC also defers parsing until after all projects are loaded - it may not be the best IDE around but I've never found it slow (once the app itself loaded) and its parser was hard to fault.

The deferred starting of the parsing. The dirwatches, sure -- if they are that slow on mac, another solution is needed.

Just because VS does this doesn't mean we need it too ... maybe they do it because it's simpler to implement in their code ...

rjvbb updated this revision to Diff 19945.Sep 26 2017, 4:28 PM
rjvbb edited the summary of this revision. (Show Details)

I'm reopening this. The patch still contains the code for import timing that I consider temporary and of use only during the review process. I'm aware it complexifies the main change, which isn't very complicated at all.

There may not be a significant benefit to deferring the start of the full project parse under normal circumstances on the overall import duration for all projects. The import times for the 2nd, 3rd etc. project do decrease for obvious reasons, meaning that the session is ready to go a bit sooner.

Deferring a resource-intensive feature also makes work on the efficiency of the preceding import easier.

rjvbb updated this revision to Diff 20446.Oct 7 2017, 3:03 PM

remove projects being closed from the list of projects with pending import finalisation.

kfunk requested changes to this revision.Nov 23 2017, 1:15 PM
kfunk added a subscriber: kfunk.

Just so my dashboard is up-to-date: This Diff contains WIP code => marking with 'Requesting changes'.

This revision now requires changes to proceed.Nov 23 2017, 1:15 PM

looking at this patch - is this still an issue? Note that I fixed the actual underlying bugs recently in master. Try it out again, probably this whole patch can be abandoned.

rjvbb abandoned this revision.Nov 24 2017, 1:27 PM

What underlying bugs and which commits?

It seems no one shares my conviction that background process like parsing should start only once the foreground process is fully ready for and available to the user. That's enough reason to drop my proposition.

From a pure conceptual POV this patch makes sense to me. I can also report the user experience that starting sessions with lots of projects has some short annoying moment, when you have to wait for the projects to appear in the project manager view, so one can finally proceed to the sources one was intending to look at. And Murphy decided it is always the project loaded last which has the files you are for.

And I have an SSD. Would be curious what the user experience is for setups with poor I/O file storage, e.g. with plain old hard discs or network-mounted filesystems. Is there anyone who could report it?

It's a short moment in usage time. But can be on a hot path ("gosh, I have to fix that bug/show that code to my boss/mate, let's open the session"). And the actual code proposed to deal with that is pretty small (not yet really looked at). So +1 from principal POV.

rjvbb added a comment.Nov 24 2017, 3:11 PM
And I have an SSD. Would be curious what the user experience is for setups with poor I/O file storage, e.g. with plain old hard discs or network-mounted filesystems. Is there anyone who could report it?

/me. My main Mac work disk is a (fast...) HDD with significant free space fragmentation, my Linux work disk is an SSHD but uses ZFS (with a low-end CPU). I concur completely with the Murphy reference and I'd add that it's not only always the project of interest that loads last, but that the wait is also always the longest (and not just perceptually) when you want it to be as short as possible.

Project import isn't only I/O bound; IIRC I saw significant CPU being spent on regexps corresponding to the project filters, the few times I profiled import.
Then there's memory: the parser is among what takes up most RAM which doesn't help if your system is already stressed like mine often are (and 8Gb is the minimum I work with).

In D7745#171643, @rjvbb wrote:

What underlying bugs and which commits?

Cf. 0828e3e5c12517e897098dda13d8045385e42c06 and previous patches.

It seems no one shares my conviction that background process like parsing should start only once the foreground process is fully ready for and available to the user. That's enough reason to drop my proposition.

Yes, that's what the code is doing. Well, wasn't, due to bugs, but it's fixed now.

The description of the linked commit indeed reads that way. If you provide snapshot appimage builds off the master branch I can verify if the commits indeed do what I hope they do, otherwise I'm afraid it'll have to wait until they get merged into the 5.2 branch.