Akonadi: fix dangling transaction after itemsync failure
ClosedPublic

Authored by dfaure on Mar 3 2019, 10:33 AM.

Details

Summary

TransactionSequence was emitting result() twice when rolling back.

  • How did this happen?

The TransactionRollbackJob is (automatically) added as a subjob of the
TransactionSequence, so when it finishes, slotResult is called (like for
all subjobs), as well as rollbackResult().
Since the latter emits result() already [mostly for symmetry with
commitResult()], we don't need to do that in slotResult (which doesn't do
it for the case of committing, either).

  • Why is it a problem to emit result() twice?

Well, first, it's against the law in KJob world. In practice,
ItemSyncPrivate::slotTransactionResult was called twice (for the same
TransactionSequence job) which made it decrement mTransactionJobs one
time too many.
As a result, checkDone() finished too early and didn't go into the
"commit transaction" branch for other transactions.
Leaving a transaction "open" is a good recipe for database deadlocks further
down the line.

  • Why did the TransactionSequence roll back in the first place?

In my case because of the infamous and not-yet fixed "Multiple merge
candidates" problem, but it seems that it can also happen when having
items without a part, according to Volker's investigations.
All of these issues still need to be fixed, but at least akonadi seems
to be still usable after they happen.

CCBUG: 399167

Test Plan

Ctrl+L in kmail, with a folder having multiple items for the same RID

Diff Detail

Repository
R165 Akonadi
Lint
Automatic diff as part of commit; lint not applicable.
Unit
Automatic diff as part of commit; unit tests not applicable.
dfaure created this revision.Mar 3 2019, 10:33 AM
Restricted Application added a project: KDE PIM. · View Herald TranscriptMar 3 2019, 10:33 AM
dfaure requested review of this revision.Mar 3 2019, 10:33 AM
  • Who wrote these faulty two lines of code?

Me, in commit f14a6e997629af96f83153eae33a0abd18366403. To fix the case that result was not emitted when the user clicks on the cancel button in kmail in the middle of the sync of a large folder. So this brings that bug back, damn.

In fact mTransactionJobs in ItemSync isn't really a counter. It's always either 1 (when we have a current transaction) or 0. So it's used as a bool, but in this bug it would go down to -1, and up to 0 when it should be 1. Any objections against making it a bool to make that code more robust? It's technically not the same as (mCurrentTransaction != nullptr) because mCurrentTransaction is set to nullptr immediately after calling commit() in checkDone(), before slotTransactionResult is called (I guess to avoid adding more subjobs to it while it's committing? Or is there no purpose in doing that, and therefore mTransactionJobs is useless even as a bool?).

Now back to debugging the case where the user cancels. I think this hits the "TODO cancel subjobs" comment in TransactionSequence::rollback()....

Nice find!

Considering the issues the job composition causes, would it be easier to not have the Transaction jobs to be a subjob of the ItemSync and handle the state propagation from the TransactionJob into ItemSync and vice versa manually instead? It's more code but feels like less unexpected magic behind the scene to me...

dfaure added a comment.Mar 3 2019, 8:21 PM

I'm not sure, not enough experience with this code for re-architecturing it, I only spent hours debugging one scenario :-)

To me KJob isn't unexpected magic but I agree that here the handling of the jobs seems to be partly in ItemSync and partly in ItemSync, which gets confusing.

But IMHO what would help even more with this code would be for it to be unittested....

dfaure added a comment.Mar 6 2019, 9:44 PM

Dan: so what's your take on this? OK as is, or requires refactoring?

dfaure added a comment.Mar 6 2019, 9:45 PM

(see also https://phabricator.kde.org/D19488 for the proper fix for canceled-by-user)

dvratil added a comment.EditedMar 6 2019, 10:28 PM

Do I understand it correctly that in combination with the other review, now that all subjobs are cleared before rollback, the emitResult() for the TransactionSequence will be called from Job::slotResult(job); because there are no more subjobs, so we no longer need to emit it ourselves?

kfunk added a subscriber: kfunk.Mar 12 2019, 12:51 PM

The result() signal from TransactionSequence is supposed to be emitted when the TransactionCommitJob or the TransactionRollbackJob finishes, see commitResult() and rollbackResult() further up in the file.

So this emitResult() was wrong and redundant -- see the beginning of this commit log.

In the case of rollback-from-user, we now (in D19488) kill all other subjobs, finish the running one, then run the rollbackjob, then emit result.

dvratil accepted this revision.Mar 21 2019, 3:46 PM
This revision is now accepted and ready to land.Mar 21 2019, 3:46 PM
This revision was automatically updated to reflect the committed changes.