Always use a job to delete files to avoid freezing process waiting on IO
AbandonedPublic

Authored by meven on Feb 20 2018, 10:55 PM.

Details

Reviewers
dfaure
ngraham
jtamate
Group Reviewers
Frameworks
Dolphin
Summary

BUG: 390748
FIXED-IN: 5.44

Remove the QFile.remove fast-path used when a file is local to prevent the calling thread to freeze waiting on the IO.

I wonder if I should also change deleteNextDir that has a similar synchronous fast-path.

The problem at hand is older than 2013 and should affect any application that might delete big files.
I think it might be responsible for more bugs outside of dolphin.

Test Plan
  1. Select a 1 Go file (or any file that you take more than 1 second to delete on your drive)
  2. Hit shift-del
  3. confirm.

Expected Result:

Dolphin UI stays responsive.

Other deletion operations are not affected, a few case to test :

  • Remove a lot of medium filesize files

Diff Detail

Repository
R241 KIO
Branch
bug-390748-make-kio-delete-asynchronous
Lint
No Linters Available
Unit
No Unit Test Coverage
meven created this revision.Feb 20 2018, 10:55 PM
Restricted Application added a project: Frameworks. · View Herald TranscriptFeb 20 2018, 10:55 PM
Restricted Application added a subscriber: Frameworks. · View Herald Transcript
meven requested review of this revision.Feb 20 2018, 10:55 PM
ngraham added a reviewer: Dolphin.

Thanks for the patch! Please remove "bug 390748" from the title and instead add "BUG: 390748" to its own line in the Summary section. See https://community.kde.org/Infrastructure/Phabricator#Formatting_your_patch

meven edited the summary of this revision. (Show Details)Feb 20 2018, 11:05 PM
meven edited the test plan for this revision. (Show Details)
meven added a reviewer: dfaure.
meven edited the summary of this revision. (Show Details)Feb 20 2018, 11:08 PM
meven edited reviewers, added: ngraham; removed: Dolphin.
meven retitled this revision from bug 390748 Always use a job to delete files to avoid freezing process waiting on IO to BUG: 390748 Always use a job to delete files to avoid freezing process waiting on IO.
meven added a reviewer: Dolphin.

Thanks! Still gotta remove "BUG: 390748" from the title, though.

meven retitled this revision from BUG: 390748 Always use a job to delete files to avoid freezing process waiting on IO to Always use a job to delete files to avoid freezing process waiting on IO.Feb 21 2018, 5:34 PM
meven updated this revision to Diff 27716.Feb 21 2018, 5:38 PM

Always use a job to delete files to avoid freezing process waiting on IOX

meven edited the test plan for this revision. (Show Details)Feb 21 2018, 7:02 PM
markg added a subscriber: markg.Feb 21 2018, 10:28 PM

While this might give you the expected result, it feels like a workaround.

I'm assuming the fast path is there for a reason and is really substantially faster then going through the job route.
If that is the case then the proper fix would be to make that code part async. That is obviously much more complex (otherwise it would've been done already).
Think of using std::async and a QEventLoop. Sounds difficult, right? It is :) But I've been playing with that kind of stuff lately so i'm happy to share an example that you can use as a starting point.
Here it is: https://p.sc2.nl/BygE-Oiwz

I wanted to paste it inline, but that already got quite big so a link it is.
I've added a bunch of comments in the code to explains what it's doing.
Note that the example does make a "QEventLoop", you should not do that within the if statement, but rather outside the while loop and simply call exec() and quit() every time (not making a new QEventLoop for every delete)

Lastly, please benchmark this fast pats (as it currently is) compared to your KIO version and my async version to see if the fast path really is the fast path. As we just don't know and that kinda influences which route to choose here.

meven edited the summary of this revision. (Show Details)Feb 23 2018, 6:33 PM
jtamate requested changes to this revision.Feb 23 2018, 7:35 PM
jtamate added a subscriber: jtamate.

My guess is that the problem is not in the direct delete of local files.
In fact, unlink() in most of the modern filesystems is not affected by the size of the file and is quite fast.

Another way to reproduce this bug:

  • Create a directory
  • Create 50.000 files of 2 bytes each one, for example with: "for i in seq -w 1 50000; do dd if=/dev/urandom of=file-$i.go bs=1 count=2; done"
  • Go to that directory, select all files and shift-supr them.
  • Confirm the dialog with the list of files to delete.

Wait too much for the task to start.
And after the notification of the work done, wait again for dolphin to become responsive.

The same problem affects the rename task in such directory,

This revision now requires changes to proceed.Feb 23 2018, 7:35 PM
meven added a comment.Feb 24 2018, 4:29 PM

While this might give you the expected result, it feels like a workaround.

I'm assuming the fast path is there for a reason and is really substantially faster then going through the job route.
If that is the case then the proper fix would be to make that code part async. That is obviously much more complex (otherwise it would've been done already).
Think of using std::async and a QEventLoop. Sounds difficult, right? It is :) But I've been playing with that kind of stuff lately so i'm happy to share an example that you can use as a starting point.
Here it is: https://p.sc2.nl/BygE-Oiwz

I wanted to paste it inline, but that already got quite big so a link it is.
I've added a bunch of comments in the code to explains what it's doing.
Note that the example does make a "QEventLoop", you should not do that within the if statement, but rather outside the while loop and simply call exec() and quit() every time (not making a new QEventLoop for every delete)

Lastly, please benchmark this fast pats (as it currently is) compared to your KIO version and my async version to see if the fast path really is the fast path. As we just don't know and that kinda influences which route to choose here.

Please correct if I am wrong but kio::file_delete will, in the end, calls FileProtocol::deleteRecursive(const QString &path) which works the same way using QFile::remove as before.
So I have the hypothesis that they should not be much difference in performance if any.

It could be impactful because of the KIO::file_delete call and I don't know precisely how expensive such a kio call is.
Or because they are more jobs class instanciated.
But those KIO::file_* calls are already used in a lot of cases in CopyJobPrivate::copyNextFile for instance to recursively copy files, so I know that at least in some cases it is ok.

Please correct me wherever I am wrong I am new in the KIO codebase. I am learning as I go.

The main drawback in this version of the patch currently, is that we loose progressive status update when a lot of files are being removed (slotReport was called each 300 deletion before).
And since needs to be fixed this at least.

markg added a comment.Feb 24 2018, 4:44 PM

While this might give you the expected result, it feels like a workaround.

I'm assuming the fast path is there for a reason and is really substantially faster then going through the job route.
If that is the case then the proper fix would be to make that code part async. That is obviously much more complex (otherwise it would've been done already).
Think of using std::async and a QEventLoop. Sounds difficult, right? It is :) But I've been playing with that kind of stuff lately so i'm happy to share an example that you can use as a starting point.
Here it is: https://p.sc2.nl/BygE-Oiwz

I wanted to paste it inline, but that already got quite big so a link it is.
I've added a bunch of comments in the code to explains what it's doing.
Note that the example does make a "QEventLoop", you should not do that within the if statement, but rather outside the while loop and simply call exec() and quit() every time (not making a new QEventLoop for every delete)

Lastly, please benchmark this fast pats (as it currently is) compared to your KIO version and my async version to see if the fast path really is the fast path. As we just don't know and that kinda influences which route to choose here.

Please correct if I am wrong but kio::file_delete will, in the end, calls FileProtocol::deleteRecursive(const QString &path) which works the same way using QFile::remove as before.
So I have the hypothesis that they should not be much difference in performance if any.

It could be impactful because of the KIO::file_delete call and I don't know precisely how expensive such a kio call is.
Or because they are more jobs class instanciated.
But those KIO::file_* calls are already used in a lot of cases in CopyJobPrivate::copyNextFile for instance to recursively copy files, so I know that at least in some cases it is ok.

Please correct me wherever I am wrong I am new in the KIO codebase. I am learning as I go.

The main drawback in this version of the patch currently, is that we loose progressive status update when a lot of files are being removed (slotReport was called each 300 deletion before).
And since needs to be fixed this at least.

This is spot on: ..."in the end, calls FileProtocol::deleteRecursive(const QString &path) "...
But it takes some time to get there.
KIO is process based. It opens a file slave and feeds it instructions via a socket connection. That architecture is why it doesn't block.
This is heavily oversimplified, but that's roughly how it works.

This communication (and subsequent parsing of the messages) just takes "time". Not a whole lot and it's quite efficient, but it starts to count with loads of files.
I don't know how much this time is, you'd have to benchmark it to know. I'm guessing it's quite substantial otherwise the fast path wouldn't be there.

As for losing progression status... I think that's a no-go. Users kinda lake to know how far something is.

dfaure requested changes to this revision.Feb 25 2018, 6:23 PM

"unlink() in most of the modern filesystems is not affected by the size of the file" doesn't match my experience, I have seen konqueror/dolphin freeze for 10s while deleting a 8GB file (on a somewhat old system, no SSD). And that would actually be the reason for this patch to go in. But on the other hand, I have a hard time believing that this patch doesn't make things slower for the case of many small files, due to the communication overhead with the kioslave (and that's the reason I wrote this code in the first place).

Maybe the right solution is to use QFile::remove if the file is small, and use the kioslave if the file is big. But finding the file size in the first place takes a little bit of time too :-)

markg added a comment.Feb 25 2018, 6:42 PM

"unlink() in most of the modern filesystems is not affected by the size of the file" doesn't match my experience, I have seen konqueror/dolphin freeze for 10s while deleting a 8GB file (on a somewhat old system, no SSD). And that would actually be the reason for this patch to go in. But on the other hand, I have a hard time believing that this patch doesn't make things slower for the case of many small files, due to the communication overhead with the kioslave (and that's the reason I wrote this code in the first place).

Maybe the right solution is to use QFile::remove if the file is small, and use the kioslave if the file is big. But finding the file size in the first place takes a little bit of time too :-)

Note that the issue here is the blocking part which @meven tried to solve :)
The performance impact this would potentially have is merely the point i happen to notice.

But the blocking issue remains, also with your suggestion of QFile::remove.
An alternative approach (that does not involve std::async) is to pre-scan the list of files for local files and send them all at once to the kioslave, just as a list of files to be deleted. A downside in that approach would be the requirement to change the slave as well to handle this.

meven added a comment.EditedApr 7 2018, 8:49 AM

In term of comparison to the FileCopyJob for instance, the difference I can spot with the DeleteJob implementation, is that the FileCopyJobPrivate itselft instanciates a subJob in this case a DirectCopyJob that will do the work, using internally SimpleJobPrivate.
This internal SimpleJobPrivate is executed in a slave thread as I understand, allowing the UI thread to run during the file copy.

I guess the right way to fix that would be to refactor the delete operation in the same manner: using a SimpleJob to benefit from the slave thread, the Job API provides.

It is pretty much already what the current patch does.

dfaure added a comment.Apr 7 2018, 7:49 PM

s/thread/process/. It's not about threads it's about two processes: the (GUI) application, and the kioslave.

But you present all this in a rather convoluted way, it's much simpler. The SimpleJob for asking a kioslave to delete a file is KIO::file_delete which exactly what this patch ends up calling.

The issue is choosing when to call unlink directly and when to ask the slave to do it. Or indeed markg's idea of sending the full list of files to the slave in one go, with a new MultiDeleteJob (including progress information). However for many small files this might be slower (that long list needs to be sent over...).

Since the bug report is about the "one big file" case, I'd say let's not fix what ain't broken (the many small files case), and let's just do "if local and small, delete in-process, otherwise use kioslave", i.e. just adding a size check for the fast path.

The optimized way to do it would be, rather than a stat() in this method, to change DeleteJobPrivate::slotEntries so that it puts big files into a separate list, for instance.

This freezing process happens for me in ktorrent the first time in a day I delete a file, but not the files after, even if they are iso images (>4GiB).
Are we sure the problem is here and not in the notification step, for example loading the sound data and starting the sound processor?
Does the freezing process happens if you delete a second big file?

This freezing process happens for me in ktorrent the first time in a day I delete a file, but not the files after, even if they are iso images (>4GiB).
Are we sure the problem is here and not in the notification step, for example loading the sound data and starting the sound processor?
Does the freezing process happens if you delete a second big file?

The current state is that it freezes as long as the file deletion (unlink) lasts.
So the behavior you observe could only be due to varying IO performance of your system. Deleting files, even big files can be fast or slow depending on a lot of parameters : filesystem usage, repartition of the file on the drive : if it is spread out on a spinning disk then it is slow ...

The current state is that it freezes as long as the file deletion (unlink) lasts.

Ok, let's go for it.

Another use case when this could be useful is when the filesystem is slow, one possible way to know if this is the case is using something like

bool KFileItemPrivate::isSlow() const
meven added a comment.EditedMay 11 2018, 11:10 AM

s/thread/process/. It's not about threads it's about two processes: the (GUI) application, and the kioslave.

But you present all this in a rather convoluted way, it's much simpler. The SimpleJob for asking a kioslave to delete a file is KIO::file_delete which exactly what this patch ends up calling.

The issue is choosing when to call unlink directly and when to ask the slave to do it. Or indeed markg's idea of sending the full list of files to the slave in one go, with a new MultiDeleteJob (including progress information). However for many small files this might be slower (that long list needs to be sent over...).

Since the bug report is about the "one big file" case, I'd say let's not fix what ain't broken (the many small files case), and let's just do "if local and small, delete in-process, otherwise use kioslave", i.e. just adding a size check for the fast path.

The optimized way to do it would be, rather than a stat() in this method, to change DeleteJobPrivate::slotEntries so that it puts big files into a separate list, for instance.

I have tried it and well DeleteJobPrivate::slotEntries is not sufficient, this function is used only when a directory is removed and not when files are removed directly.
So this would require to add more complexity and an added stat call to the chain of calls.

I plan to do some benchmarking to evaluate the impact of the current patch.
I still feel the current patch is good because it is consistent with other Jobs (copyJob for instance) and keeps the code clean.

So before trying adding some special cases adding more complexity to preserve marginaly performance in corner-use cases (load of files), I'd rather study the current patch.
As a side note all other jobs have the same issue regarding load of calls to kio but are not blocking the calling thread (except the rename job apparently).

To me the bug looks more like a usability concern : is kio asynchrounous and usable for UIs or not ?

I am working on a benchmark script, with and without the current patch using kioclient contained in kde-cli-tools.
I will share the script and results, of course.

Restricted Application added a subscriber: kde-frameworks-devel. · View Herald TranscriptMay 11 2018, 11:10 AM
meven added a comment.Jun 3 2018, 9:41 AM

Here is the script I have been using : https://gist.github.com/meven/f0b2a36c61240e1d6e19753afd1d3d68

My benchmark logic is :
Create a folder with x files of k sizes.
Copy this folder.
Delete this folder using kfmclient, measuring the elapsed time

I have used 100000 files of 1 byte, 100000 of 1kb, 1000 of 3mb, 3 of 1Gb.

I ran this on a 2TB hard disc drive.

Here are the very limited results:

In seconds
                			avg
before
1b		1,586	1,738	1,684	1,66933333333333
1k	        1,719	1,777	1,649	1,715
3mb	        9,206	9,164	8,419	8,92966666666667
1gb	        30,736	20,559	29,981	27,092

after
1b 		4,637	1,721	1,599	2,65233333333333
1k	        32,186	1,726	1,685	11,8656666666667
3mb	        2,491	7,287	7,896	5,89133333333333
1gb	        1,464	13,344	17,271	10,693

It appears my benchmark methodology is mostly of no use due to huge outliners values.
I am using kfmclient, but more than time I would need to measure the memory overhead also of using the kioslave instead of the fast-path but this could be tricky with cross-process execution.

Also I think that if my patch was to get through we nay need to treat as a signature change: the behavior of the KIO::DeleteJob would change quite a bit from being most of the time synchronous to being asynchronous.
Some App may have built on the assumption (knowingly or not) that the function only returns after the file(s) have been removed, which would not the case after this patch.
An opt-in boolean option could be needed to trigger the new behavior while keeping the old one for applications that have been updated/reviewed yet and perhaps mark as deprecated the old behavior.

Also I would like to take the time to add some tests, although I need to learn about how to write some and a .

I would much apprieciate feedback.
As the solution is not obvious and still in debatable to me, we could set up some IRC meeting. I will be hanging on #kde-fm.

markg added a comment.Jun 3 2018, 1:55 PM

Here is the script I have been using : https://gist.github.com/meven/f0b2a36c61240e1d6e19753afd1d3d68

My benchmark logic is :
Create a folder with x files of k sizes.
Copy this folder.
Delete this folder using kfmclient, measuring the elapsed time

I have used 100000 files of 1 byte, 100000 of 1kb, 1000 of 3mb, 3 of 1Gb.

I ran this on a 2TB hard disc drive.

Here are the very limited results:

In seconds
                			avg
before
1b		1,586	1,738	1,684	1,66933333333333
1k	        1,719	1,777	1,649	1,715
3mb	        9,206	9,164	8,419	8,92966666666667
1gb	        30,736	20,559	29,981	27,092

after
1b 		4,637	1,721	1,599	2,65233333333333
1k	        32,186	1,726	1,685	11,8656666666667
3mb	        2,491	7,287	7,896	5,89133333333333
1gb	        1,464	13,344	17,271	10,693

It appears my benchmark methodology is mostly of no use due to huge outliners values.
I am using kfmclient, but more than time I would need to measure the memory overhead also of using the kioslave instead of the fast-path but this could be tricky with cross-process execution.

Also I think that if my patch was to get through we nay need to treat as a signature change: the behavior of the KIO::DeleteJob would change quite a bit from being most of the time synchronous to being asynchronous.
Some App may have built on the assumption (knowingly or not) that the function only returns after the file(s) have been removed, which would not the case after this patch.
An opt-in boolean option could be needed to trigger the new behavior while keeping the old one for applications that have been updated/reviewed yet and perhaps mark as deprecated the old behavior.

Also I would like to take the time to add some tests, although I need to learn about how to write some and a .

I would much apprieciate feedback.
As the solution is not obvious and still in debatable to me, we could set up some IRC meeting. I will be hanging on #kde-fm.

That downside is not a solution.
Specifically with large folders, you'd really like to see some form of progression as it's being deleted. I would.

Please take a good close look at my updated example of how to make this async:
https://p.sc2.nl/ByvSb_-lQ
I've added comments everywhere in there to explain how the code works and what needs to be done to use it in here.
Just run it (you need C++14 and Qt) to see how it looks. (my qmake file incase you want it: https://p.sc2.nl/Hy25fO-xm)

It keeps it on the client process (the performance optimization of David Faure) but offloads it to a different thread.
I've extended the example to show the number of files that have been deleted every 300ms.
That in it's own is slightly different to what the code currently does. Currently it calls slotReport after every 300 files. With this it would call slotReport after every 300ms. I don't think that's much of a problem.

I can try and make a KIO patch for this if you want. As i now have the logic in that example code. (i just don't have a working frameworks dev environment at the moment)

meven added a comment.Jun 3 2018, 2:46 PM

Great suggestion Mark !

I am a C++ beginner, I did not consider this neat C++ 14 feature.

This will necessitate a c++ compiler dependency change though.
Like Kwin did last July https://github.com/KDE/kwin/commit/ea5d611de1bc33869c13c27d75a7827201a5139d

That in it's own is slightly different to what the code currently does. Currently it calls slotReport after every 300 files. With this it would call slotReport after every 300ms. I don't think that's much of a problem.

I think a time based update would make more sense to the user.

I think deleteFiles and deleteDirs should both be wrapped in the async function.
Otherwise, at best we would end up with multiple parallel file deletion which is not preferable (given current filesystems and hardware, we should favor sequential deletion) and at worst the same as today blocking the main thread.
Or we might need some mutex/buffer to synchronize the unlink syscalls through Qt::remove() between different async deletion functions.

So this plus the added necessary synchronizing code, this might end up a big code change.

I will give a spin.

markg added a comment.Jun 3 2018, 3:28 PM

Great suggestion Mark !

I am a C++ beginner, I did not consider this neat C++ 14 feature.

This will necessitate a c++ compiler dependency change though.
Like Kwin did last July https://github.com/KDE/kwin/commit/ea5d611de1bc33869c13c27d75a7827201a5139d

Well, it's nearly all C++11 :)
Only the "300ms" (specifically the "ms") in there is new in C++14 (chrono literals: https://en.cppreference.com/w/cpp/chrono/duration). In this case removing the ms would make it C++11 and probably work (i haven't tested that).
You could make it all in Qt (QThread, QFuture, ...). I'm just not that used to those classes en went for the STL ones instead.

That in it's own is slightly different to what the code currently does. Currently it calls slotReport after every 300 files. With this it would call slotReport after every 300ms. I don't think that's much of a problem.

I think a time based update would make more sense to the user.

I think deleteFiles and deleteDirs should both be wrapped in the async function.
Otherwise, at best we would end up with multiple parallel file deletion which is not preferable (given current filesystems and hardware, we should favor sequential deletion) and at worst the same as today blocking the main thread.
Or we might need some mutex/buffer to synchronize the unlink syscalls through Qt::remove() between different async deletion functions.

So this plus the added necessary synchronizing code, this might end up a big code change.

Take care with modifying those class members though. You'd probably want to protect them with a mutex. Even though this code probably won't cause a race condition, better be safe then sorry :)

I will give a spin.

meven abandoned this revision.Nov 12 2019, 12:33 PM

Abandoned in favor of D24962