BUG:360677 KisUpdaterContext should not kill threads in waitForDone()
ClosedPublic

Authored by asavonic on Mar 17 2016, 10:46 PM.

Details

Summary

KisUpdaterContext::waitForDone() method must lock context and wait
untill all threads finish their current tasks.
We cannot call QThreadPool.waitForDone() to do this, because after
waiting it removes all threads from the pool.

Test Plan

Verified that no extra threads created: 8 for global thread pool and 8 for
KisUpdaterContext

(gdb) info threads

Id   Target Id         Frame
22   Thread 0x7fff8cff9700 (LWP 18481) "Thread (pooled)" 0x00007fffef6e2cc8 in pthread_cond_timedwait () from /lib64/libpthread.so.0
21   Thread 0x7fff8d7fa700 (LWP 18480) "Thread (pooled)" 0x00007fffef6e2cc8 in pthread_cond_timedwait () from /lib64/libpthread.so.0
20   Thread 0x7fff8dffb700 (LWP 18479) "Thread (pooled)" 0x00007fffef6e2cc8 in pthread_cond_timedwait () from /lib64/libpthread.so.0
19   Thread 0x7fff8e7fc700 (LWP 18478) "Thread (pooled)" 0x00007fffef6e2cc8 in pthread_cond_timedwait () from /lib64/libpthread.so.0
18   Thread 0x7fff8effd700 (LWP 18477) "Thread (pooled)" 0x00007fffef6e2cc8 in pthread_cond_timedwait () from /lib64/libpthread.so.0
17   Thread 0x7fff8f7fe700 (LWP 18476) "Thread (pooled)" 0x00007fffef6e2cc8 in pthread_cond_timedwait () from /lib64/libpthread.so.0
16   Thread 0x7fff8ffff700 (LWP 18475) "Thread (pooled)" 0x00007fffef6e2cc8 in pthread_cond_timedwait () from /lib64/libpthread.so.0
15   Thread 0x7fff9bfff700 (LWP 18474) "Thread (pooled)" 0x00007fffef6e2cc8 in pthread_cond_timedwait () from /lib64/libpthread.so.0
13   Thread 0x7fffb8ff9700 (LWP 18471) "QFileInfoGather" 0x00007fffef6e291f in pthread_cond_wait () from /lib64/libpthread.so.0
12   Thread 0x7fffb97fa700 (LWP 18470) "Thread (pooled)" 0x00007fffef6e2cc8 in pthread_cond_timedwait () from /lib64/libpthread.so.0
11   Thread 0x7fffb9ffb700 (LWP 18469) "Thread (pooled)" 0x00007fffef6e2cc8 in pthread_cond_timedwait () from /lib64/libpthread.so.0
10   Thread 0x7fffba7fc700 (LWP 18468) "Thread (pooled)" 0x00007fffef6e2cc8 in pthread_cond_timedwait () from /lib64/libpthread.so.0
9    Thread 0x7fffbaffd700 (LWP 18467) "Thread (pooled)" 0x00007fffef6e2cc8 in pthread_cond_timedwait () from /lib64/libpthread.so.0
8    Thread 0x7fffbb7fe700 (LWP 18466) "Thread (pooled)" 0x00007fffef6e2cc8 in pthread_cond_timedwait () from /lib64/libpthread.so.0
7    Thread 0x7fffbbfff700 (LWP 18465) "Thread (pooled)" 0x00007fffef6e2cc8 in pthread_cond_timedwait () from /lib64/libpthread.so.0
6    Thread 0x7fffc8f8a700 (LWP 18464) "Thread (pooled)" 0x00007fffef6e2cc8 in pthread_cond_timedwait () from /lib64/libpthread.so.0
5    Thread 0x7fffc978b700 (LWP 18463) "Thread (pooled)" 0x00007fffef6e2cc8 in pthread_cond_timedwait () from /lib64/libpthread.so.0
4    Thread 0x7fffca38d700 (LWP 18462) "KisTileDataSwap" 0x00007fffef6e291f in pthread_cond_wait () from /lib64/libpthread.so.0
3    Thread 0x7fffcab8e700 (LWP 18461) "KisTileDataPool" 0x00007fffef6e291f in pthread_cond_wait () from /lib64/libpthread.so.0
2    Thread 0x7fffe0843700 (LWP 18460) "QXcbEventReader" 0x00007ffff16a33ed in poll () from /lib64/libc.so.6
  • 1 Thread 0x7fffe7f3f7c0 (LWP 18456) "krita" 0x00007ffff16a33ed in poll () from /lib64/libc.so.6

Diff Detail

Repository
R37 Krita
Lint
Automatic diff as part of commit; lint not applicable.
Unit
Automatic diff as part of commit; unit tests not applicable.
asavonic updated this revision to Diff 2827.Mar 17 2016, 10:46 PM
asavonic retitled this revision from to BUG:360677 KisUpdaterContext should not kill threads in waitForDone().
asavonic updated this object.
asavonic edited the test plan for this revision. (Show Details)
asavonic added a reviewer: dkazakov.
fazek added a subscriber: fazek.Mar 18 2016, 12:09 PM

Just an idea: perhaps it's not a big problem, but this is a cpu hungry solution. You could wait for the threads one by one, always for a single thread to finish. Then you can do it with something like pthread_join() or QThread::wait(), I don't know how the threads are implemented.

Seems this patch causes deadlock (sometimes) when I try to open new tab. I'll fix it.

dkazakov requested changes to this revision.EditedMar 23 2016, 3:36 PM
dkazakov edited edge metadata.

Hi, @asavonic!

Sorry for not replying quickly. The fact that the threads are releases when we call waitForDone() is a really good catch! I was always wondering why is happens and could never investigate it properly :)

The busy loop in the patch is really a bad idea, it saturates at least one CPU core and prevents other threads to do any useful work. This is quite bad idea. There are at least two options how this problems can be fixed/avoided:

  1. Use QWaitCondition coupled with QThreadPool::activeThreadCount(). At least if I understood the meaning of activeThreadCount() correctly. You can check for an example of using QWaitCondition in KisUpdateScheduler::blockUpdates()
  1. Another, less intrusive option would be to modify KisUpdateScheduler::isIdle() not to call tryBarrierLock(), but just use activeThreadCount() or something like that. This would be the best option, I guess. We call isIdle() once in a second or so, so this is the exact reason why the threads are dying. The other usage of the barrierLock() is legacy strokes, but they are not so popular.
This revision now requires changes to proceed.Mar 23 2016, 3:36 PM

I'm still working on this. Another implementation with QSemaphore created more problems than it solved.

asavonic updated this revision to Diff 3464.Apr 23 2016, 12:18 AM
asavonic edited edge metadata.
  • Use conditional variable in KisUpdaterContext.waitForDone()
  • Use KisUpdateScheduler::barrierLock() in KisImage dtor
asavonic updated this revision to Diff 3465.Apr 23 2016, 1:02 AM
asavonic edited edge metadata.
  • Use conditional variable in KisUpdaterContext.waitForDone()
  • Use KisUpdateScheduler::barrierLock() in KisImage dtor

This is still in progress, got a weird crash in KisImageTest.

ASSERT: "isDetached()" in file /usr/include/qt5/QtCore/qvector.h, line 551
#3  0x00007ffff31b7c7e in qt_assert(char const*, char const*, int) () from /usr/lib64/libQt5Core.so.5
#4  0x00007ffff78c90c5 in QVector<KisUpdateJobItem*>::reallocData (this=0x1e2a798, asize=8, aalloc=8, options=...) at /usr/include/qt5/QtCore/qvector.h:551
#5  0x00007ffff78c89b5 in QVector<KisUpdateJobItem*>::detach (this=0x1e2a798) at /usr/include/qt5/QtCore/qvector.h:355
#6  0x00007ffff78c8449 in QVector<KisUpdateJobItem*>::data (this=0x1e2a798) at /usr/include/qt5/QtCore/qvector.h:122
#7  0x00007ffff78c7ba0 in QVector<KisUpdateJobItem*>::operator[] (this=0x1e2a798, i=0) at /usr/include/qt5/QtCore/qvector.h:402
#8  0x00007ffff78c586c in KisUpdaterContext::waitForDone (this=0x1e2a778) at /home/asavonic/dev/krita/src/krita/libs/image/kis_updater_context.cpp:195
#9  0x00007ffff78e08da in KisUpdateScheduler::barrierLock (this=0x1eb1fd0) at /home/asavonic/dev/krita/src/krita/libs/image/kis_update_scheduler.cpp:337
#10 0x00007ffff79006e0 in KisImage::waitForDone (this=0x7e0080) at /home/asavonic/dev/krita/src/krita/libs/image/kis_image.cc:1161
asavonic updated this revision to Diff 3466.EditedApr 23 2016, 2:20 AM
  • Use thread-safe iteration over QVector in waitForDone()

The crash above is fixed, ready for review.

fazek added a comment.Apr 23 2016, 4:40 AM

The cpunters for the threads is a good idea. For me this solution is good if the number of the threads and their order cannot change during the wait cycle.

Hi, @asavonic!

Your patch is really nice! I like the idea of making a snapshot of the jobs and checking them on waking up. There is a problem though. The patch changes the semantics of the waitForDone() call. Originally, waitForDone() in the QThreadPool waited for all the threads to finish. And since each thread before finishing started a new job, waitForDone() could not return before all the jobs of all the strokes and/or updates are finished. And this patch tries to solve a bit different problem: "wait until all the currently running jobs are finished, forget about the new ones". Your problem is actually a bit more complex, and yes, it needs snapshots and all the complications with getting snapshots.

Changing this behavior can lead to some unexpected problems in our code, because, e.g. KisToolTransform::slotRestartTransform() and all the unittests reliy on this behavior. Not sure if it is good idea, but it is what we have.

Will it work if you simplify the solution a bit?

  1. In checkWaitCondition() just check if all the jobs are not running, without using the snapshots.
  2. Use m_lock for waiting on a wait condition. That'll fix the problem of checking the validity of m_jobs.

"Starving" problem for a waiting threads is not a problem for us, because in most of the cases that is exactly what they want :)

libs/image/kis_image.cc
1161 ↗(On Diff #3466)

This must be a misprint :) Either it is a debugging code or unlock() is forgotten

libs/image/kis_updater_context.cpp
227

(This comment deprecated due to my comment below in the review section)

To have safe access to m_jobs the context itself should also be locked while doing this iteration. Otherwise the code in KisStrokesQueue::processQueue() and KisSimpleUpdatesQueue::processQueue() will overwrite the contents of m_jobs.

And using Q_FOREACH without locking will cause a crash in a case when there is more than three threads accessing the m_jobs object. It happens because of implicit sharing in a temporary object (created inside Q_FOREACH), which will be detached when some other thread changed the original vector. You can check implementation of KisSafeReadNodeList and FOREACH_SAFE in KisNode to find more details about this problem.

Basically, I would suggest just to add the locking, then using Q_FOREACH will become safe.

But take care about the order of locking to avoid deadlocks. Ideally you'd better avoid having both locks locked at the same time. I'm not sure but it seems like moving m_condWaitLock.lock() below the loop should be ok.

asavonic updated this revision to Diff 3563.EditedApr 29 2016, 8:27 PM
  • Use semaphore to wait for all threads in KisUpdateContext

It is also necessary to change the order of signals emitted from the
job: we must emit sigJobFinished to release the semaphore and only after
that emit sigDoSomeUsefulWork.

Otherwise we will get a deadlock:

thread #1 call stack:
  - KisUpdaterContext::startJob // cannot acquire the semaphore
                                // since all threads are stopped
  - KisSimpleUpdateQueue::processQueue // locked the queue
  - KisUpdateScheduler::fullRefreshAsync

thread #2 call stack:
  - KisSimpleUpdateQueue::optimize       // cannot lock the queue
  - KisUpdateScheduler::doSomeUsefulWork
  - KisUpdateJobItem::run
asavonic added a comment.EditedApr 29 2016, 9:20 PM

Hi, @dkazakov

The patch changes the semantics of the waitForDone() call.

Yeah, I noticed that, quite a lot of code depends on this behaviour.

Will it work if you simplify the solution a bit?
In checkWaitCondition() just check if all the jobs are not running, without using the snapshots.
"Starving" problem for a waiting threads is not a problem for us, because in most of the cases that is exactly what they want :)

This is actually the solution I tried to avoid, because we can possibly wait forever in waitForDone() if we always have some jobs running (even if they are not related).
But, I guess you are right, without explicit dependencies between the jobs it is better to wait them all to finish.

I've pushed the fix (see the comment above).

And using Q_FOREACH without locking will cause a crash in a case when there is more than three thread

Didn't know this, thank you.

fazek added a comment.Apr 30 2016, 3:28 AM

As far as I can understand, the original solution closes and reopens the threads. This works, just takes too much time and resources. So a solution to wait for a monent when there is no thread running, is good. It's only a hazard for the future, currently nothing creates new jobs infinitely during the wait.

Hi, @asavonic!

Speaking truly, I like your previous implementation more :)

I tried to calculate the locks order and it seems like your algorithm is actually safe. The only thing that worries me is that now the semaphore duplicates functionality of m_jobs and that KisSimpleUpdateQueue::optimize() call is neither covered by the semaphore nor m_jobs structure. It should not cause any problems right now, but it might be a problem in the future, because a semaphore is basically a lock, so it should be considered carefully.

The main blocker right now is that optimize() is called after the actually processed the queue. That is not good because we may end up doing to many unnecessary updates.

So I can see two solutions:

  1. Fix your current patch to ensure optimize() is called before processQueue().
  2. Get back to your QWaitCondition solution, just simplify it with having no snapshots. QWaitCondition doesn't add any lock (it can be based on m_lock), so it doesn't add complexity to the system. At least I guess so :)

And if you have any questions, don't hesitate to ping me on IRC. I might be wrong as well :)

libs/image/kis_update_job_item.h
74 ↗(On Diff #3563)

The order of these call is important. In sigDoSomeUsefulWork() the updates queue is optimized to not do the same work multiple times, and in
sigJobFinished() the queue is processed and the optimized tasks are put into the context. So this should be at least refactored somehow: new signal ca be added or the content of the slots can be regrouped.

asavonic updated this revision to Diff 5529.Jul 27 2016, 8:58 PM

Use QWaitCondition in waitForDone(), add asserts for lock()/unlock()

  • Add asserts when KisUpdaterContext methods used without lock()
  • Assume KisUpdaterContext::waitForDone() called with lock() held
  • Lock KisUpdaterContext where needed
asavonic updated this revision to Diff 5531.Jul 27 2016, 9:49 PM

Move all asserts under #ifndef QT_NO_DEBUG

Hi, @asavonic!

I'm sorry, I didn't notice you updated the patch! :)

The patch looks cool now, there are only three minor issues:

  1. waitForDone() is usually considered to work like lock()/unlock() so people usually doesn't expect it to need any locking. Please move the locking inside waitForDone() itself :)
  2. Please rename QT_NO_DEBUG into something local, like SANITY_CHECK_CONTEXT_LOCKING. We usually don't control the Qt's debugging definitions, and this sanity check should obviously be disabled in most of the cases :)
  3. Btw, it would be a good idea to change Q_ASSERT into KIS_ASSERT, because the former ones are always disabled on Ubuntu.

Please fix these small issues and I will push your patch! :)

libs/image/kis_update_scheduler.cpp
417

Good catch! :)

dkazakov requested changes to this revision.Aug 14 2016, 5:35 PM
dkazakov edited edge metadata.

Mark as needs changes

This revision now requires changes to proceed.Aug 14 2016, 5:35 PM
asavonic updated this revision to Diff 5954.Aug 15 2016, 9:17 PM
asavonic edited edge metadata.
  • Revert lock() requirement on waitForDone
  • Replace Q_ASSERT with KIS_ASSERT
  1. waitForDone() is usually considered to work like lock()/unlock() so people usually doesn't expect it to need any locking. Please move the locking inside waitForDone() itself :)

Okay, but since the context lock is not recursive it implies the opposite requirement: we must not call waitForDone with lock held.
But now we have an assert for this in lock() :)

  1. Please rename QT_NO_DEBUG into something local, like SANITY_CHECK_CONTEXT_LOCKING. We usually don't control the Qt's debugging definitions, and this sanity check should obviously be disabled in most of the cases :)

Maybe this is debatable, but I think these checks make sense only if they enabled by default (in debug builds only, of course).
Their only purpose is to warn about an improper use of KisUpdaterContext functions (calling them without locks).
Buy if you need a special macro to do this check, and you know about its existence, then you are probably already aware about the requirements.

  1. Btw, it would be a good idea to change Q_ASSERT into KIS_ASSERT, because the former ones are always disabled on Ubuntu.

Good to know. Actually, Q_ASSERT works as expected on my Ubuntu machine (Qt 5.5.1), so it maybe depends on a Qt version.

Okay, but since the context lock is not recursive it implies the opposite requirement: we must not call waitForDone with lock held.
But now we have an assert for this in lock() :)

Awesome!

  1. Please rename QT_NO_DEBUG into something local, like SANITY_CHECK_CONTEXT_LOCKING. We usually don't control the Qt's debugging definitions, and this sanity check should obviously be disabled in most of the cases :)

Maybe this is debatable, but I think these checks make sense only if they enabled by default (in debug builds only, of course).
Their only purpose is to warn about an improper use of KisUpdaterContext functions (calling them without locks).
Buy if you need a special macro to do this check, and you know about its existence, then you are probably already aware about the requirements.

Well, the point is that most of the developers use RelWithDebInfo builds. Or its derivative, like KritaDevs. So in 99% of the cases the check will be disabled and you will not be able to activate it without rebuilding entire Krita. We use DebugFull builds very rarely because they tend to be extremely slow.

  1. Btw, it would be a good idea to change Q_ASSERT into KIS_ASSERT, because the former ones are always disabled on Ubuntu.

Good to know. Actually, Q_ASSERT works as expected on my Ubuntu machine (Qt 5.5.1), so it maybe depends on a Qt version.

I think it is related to the RelWithDebInfo thing I wrote above

libs/image/kis_update_scheduler.cpp
173

Yes, this reordering makes the code look much more sane! :)

dkazakov requested changes to this revision.Aug 16 2016, 6:50 AM
dkazakov edited edge metadata.

I have checked your patch. It works perfectly fine, but as I expected, QT_NO_DEBUG is not defined in KritaDevs build mode. Of course, I can push the patch as it is, but it means that noone will run your sanity check before the release. I can propose the following solution to the problem:

  1. Define a separate definition SANITY_CHECK_CONTEXT_LOCKING in a straightforward way in a header
#define SANITY_CHECK_CONTEXT_LOCKING
  1. People will test it before the release
  2. Right before the release we will ifdef the definition so people will not be distracted with it
#ifndef QT_NO_DEBUG
#define SANITY_CHECK_CONTEXT_LOCKING
#endif /* QT_NO_DEBUG */

How do you like it? If you are ok with the solution, please either update the patch or just write me and I will modify the patch right before pushing it.

This revision now requires changes to proceed.Aug 16 2016, 6:50 AM

How do you like it? If you are ok with the solution

Yes, I think it is the best option.

or just write me and I will modify the patch right before pushing it.

That would be great, thank you!

asavonic updated this revision to Diff 6000.Aug 17 2016, 5:42 PM
asavonic edited edge metadata.
  • Move debug checks under a SANITY_CHECK_CONTEXT_LOCKING macro
dkazakov accepted this revision.Aug 18 2016, 6:56 AM
dkazakov edited edge metadata.

The patch works and looks perfectly fine now! I'm going to push it! :)

This revision is now accepted and ready to land.Aug 18 2016, 6:56 AM
This revision was automatically updated to reflect the committed changes.