Improve Error Recovery Now That We Have Millions of Users
Open, Needs TriagePublic

Description

Currently, we build Krita for our users with HIDE_SAFE_ASSERTS=ON. To our users, hitting a KIS_*_ASSERT that means a crash. The assert message itself is added to the logfile.

Currently, if we would build Krita with HIDE_SAFE_ASSERTS=OFF, the users would see a scary message box warning them Krita is about to crash, with ignore and abort buttons. Abort promises more information, but doesn't give that information except on platforms with drkonqui running.

Definitions:

  • Q_ASSERT: a real assert, of the type that checks whether something is like the developer thought, aborts Krita in a developer build and is compiled out in a release build because the developer knows the problem isn't so bad the application halts
  • KIS_*_ASSERT: a family of "fake" asserts, that right now either shows a message box, shows a warning or throws an exception caught by KisApplication.

Now that we have millions of users, we want to:

  • Not scare our very untechnical users
  • Not make our users lose data unnecessarily:
    • no qFatal until the user has had a chance to save the file
    • or autosave all files on hitting an assert and then abort?
  • Not abort if it isn't really necessary.
  • Gather useful information for bug reports
  • Make development easier by having asserts abort Krita during development

For that we need to reimagine the way our error detection and recovery works.

  • Asserts need to be logged, with a full backtrace
  • We need to review all KIS*ASSERTS and Q_ASSERTS to see which ones should be converted to KIS_SAFE*ASSERTS.
    • Saving code should never be able to hit any assert
    • Asserts that claim to be recoverable should recover, as gracefully as possible, except in developer builds
    • Asserts that cannot recover should autosave and restart krita, except in developer builds
  • We need a system to distinguish between developer builds and release builds
    • developer builds should halt
    • release builds should show a message in the system log
  • We need a system to distinguish between alpha/beta builds and final release builds
    • Alpha/beta builds should have a message box that shows the assert and presents an easy way to report the bug, including the backtrace (kisBacktrace)
    • Release builds should act to save the user's data, then tell the user that Krita will restart and that any data can be recovered.

Related Notes

  • We assume that qCritical aborts. It doesn't, but we say so in the error messages.
  • KoImageCollection has an abort() call
  • bspline_create.cpp has many abort() calls
  • kxmlguiclient.cpp has an abort() call
  • We have 63 qFatal calls.
  • Q_ASSERT is only compiled in Debug mode, not RelWithDebInfo, which we use for our release builds.
  • KIS_*_ASSERT is always built, in Debug RelWithDebInfo and Release mode

Better Bug Reports

Asserts that show information to the user should use KBugReport. We want to extend KBugReport using the REST api of bugzilla (https://bugzilla.readthedocs.io/en/5.0/api/core/v1/bug.html#create-bug).

It would be very useful to add the system info file, the system log and the crash log (on Windows) as attachments or comments to bug reports created with kbugreport. For this, we need permission from sysadmin.

  • kisBacktrace is dependent on there being a function present called backtrace? On Linux, this is true with Release builds as well. On macOS this is true with RelWithDebInfo.

Also: only on Linux do we have drkonqui. On macOS we have nothing, and on Windows we have a kritacrash.log with (bad) backtraces and a krita.log which contains the last backtrace from kritacrash.log after ending with an error.

Saving The Data

kis_assert.cpp says:

/**
 * TODO: Add automatic saving of the documents
 *
 * Requirements:
 * 1) Should save all open KisDocument objects
 * 2) Should *not* overwrite original document since the saving
 *    process may fail.
 * 3) Should *not* overwrite any autosaved documents since the saving
 *    process may fail.
 * 4) Double-fault tolerance! Assert during emergency saving should not
 *    lead to an infinite loop.
 */

But actually, saving the document should be save; the last save is in the backup document, and copying the new version over the old is atomic, these days.

rempt created this task.May 15 2020, 1:34 PM
rempt updated the task description. (Show Details)
rempt updated the task description. (Show Details)May 15 2020, 1:37 PM
rempt updated the task description. (Show Details)May 15 2020, 1:42 PM
rempt updated the task description. (Show Details)May 15 2020, 1:45 PM
rempt updated the task description. (Show Details)May 15 2020, 1:47 PM
rempt updated the task description. (Show Details)May 15 2020, 1:52 PM
rempt updated the task description. (Show Details)May 15 2020, 1:58 PM
rempt updated the task description. (Show Details)
rempt updated the task description. (Show Details)May 15 2020, 2:22 PM
rempt updated the task description. (Show Details)May 15 2020, 2:25 PM

Original "Usage policy" for KIS_ASSERT

Original ideas under KIS_ASSERT:

  1. An "assert" is a problem in program's logic. In most of the cases it means that requirements of the code, defined by the code's author, are not satisfied anymore. In most of the cases, it means that the program has a bug.
  1. When the author adds a "requirement" to its code in a form of an assert, he/she can add some recovery path. That is, the program can continue, though the behavior might look buggy for the user.
  1. Theory of reliability provides us with definition of two kinds of errors: soft errors and hard errors.
    • hard errors are discovered at the moment they happened
    • soft errors are not discovered at all

See an Alexandrescu's video for an explanation of soft/hard errors: https://www.youtube.com/watch?v=nVzgkepAg5Y

Definition of assert types

  1. KIS_SAFE_ASSERT_RECOVER*() --- this failure means that some logical error (most probably a bug) has happened in Krita, but the user can safely continue working on his/her document.
  1. KIS_ASSERT_RECOVER*() --- this failure means that some logical error (most probably a bug) has happened in Krita, but the user must stop working on his/her document now, save it and restart Krita.
  1. KIS_ASSERT() --- it was expected to be used as a replacement for Q_ASSERT, when they are disabled. I'm not sure we should actually use it without any recovery path.

Discussion

Since we hide safe asserts for release builds, they may be considered as a soft error for the user. The user may continue working, even though Krita may run in a bit unexpected way. Though we guarantee that there will be no crash and no data loss.

And normal assert is a "hard error". They are never hidden and the user must know about them right at the moment they happen to avoid data-loss.

Proposed roadmap

  1. Review all the usages of KIS_ASSERT and try to downgrade them to a recoverable assert, either safe or unsafe.
  1. Try to downgrade some of the normal asserts to safe asserts. I'm not sure if it is possible, but it could help at least in some cases.
  1. Refactor UIX of the assert dialog:
    • add a textbox with the backtrace (therefore the screenshots provided by the users will give us more data)
    • hide "Abort" button under some "Advanced" tab. This button is still needed for developers/testers for generation of the proper backtrace under a debugger.
  1. Finally implement a fault-tolerant save-all-documents feature, that would save user's stuff in emergency case.

Hi, @rempt!

But actually, saving the document should be save; the last save is in the backup document, and copying the new version over the old is atomic, these days.

Even though saving itself is safe now, we should never rewrite anything during emergency saving. When saving in an "assert-state" the state of the image is not defined anymore, e.g. layers may have dangling pointers, pixel data may be corrupted and so on. Even if Krita manages to save the document without a crash, there is no guarantee that actual pixel data will still be valid.