Krita database migration system
Open, Needs TriagePublic

Description

Main idea (suggested by Dmitry, based on Ruby on Rails, I believe? And I (Tiar) think it's neat):

  • one query to create a database
  • a query to change the schema from 0.0.1 to 0.0.2
  • a query to change the schema from 0.0.2 to 0.0.3
  • etc.
  • possibly flattening 0.0.1-0.n.m when we stop supporting anything below 0.n.m
  • when the current version of Krita's database is 0.0.5 and there is no database file, then Krita just applies all queries in order: first the base query, then 0.0.1-0.0.2, then 0.0.2-0.0.3, etc. until it gets to 0.0.5
  • when the current version of Krita's database is 0.0.5 and there is a database file of version 0.0.3, Krita just applies 0.0.3-0.0.4 guery, and then 0.0.4-0.0.5.

Questions:

  • what is Halla's opinion about it?
  • what are alternative ways of making the database migration system more clean/clear (right now it's a bit messy, since it's all manual in the code)?
  • what about users who use Krita 5.1.0 and then go back to 5.0.2 and then 5.0.0-beta1 and then 5.1.0? (As in, what if "current" database version is smaller than the version of the existing file, because the current Krita version is older than the one that created or updated the database?) - while it's fine to not support this, it would be good to allow users to use nightly versions, so going at least two-three versions back should be possible without bad repercussions, if we can achieve that.
tymond created this task.Feb 15 2022, 8:26 PM
tymond claimed this task.
tymond added a project: Krita.
tymond updated the task description. (Show Details)
rempt added a subscriber: rempt.EditedFeb 16 2022, 12:56 PM

My take:

  • You cannot create the database in one query, you need separate statements for tables and indexes -- though maybe that isn't what you meant with "one query to create a database"?
  • You can't always alter the database using a query. Sometimes you will have to drop a table and recreate it, for instance when column names change.
  • Piling on alter database statements to come to the current version will be slow and fragile: we should not do that.
  • We shouldn't write fragile code to support going back a version.
  • We should strive for a very stable database schema.
  • Our database creation code should create the latest version of the database.

Having possible steps between versions of the database that update an existing database to the latest version is fine, but should be decided on a case-by-case basis. My experience with maintaining a database schema dates back to the nineties, but this is how we did it back then, for our customers...

Having possible steps between versions of the database that update an existing database to the latest version is fine, but should be decided on a case-by-case basis. My experience with maintaining a database schema dates back to the nineties, but this is how we did it back then, for our customers...

Well, the things have changed significantly since then. The iterative approach works perfectly fine for Ruby-on-Rails and projects like that. Given that no other code in the app modifies the database schema, other than the migrations, it should be safe.

You cannot create the database in one query, you need separate statements for tables and indexes
You can't always alter the database using a query. Sometimes you will have to drop a table and recreate it, for instance when column names change.

I agree here. We cannot modify the schema with SQL statements only. We need some "migration entity" in C++ (a class?). A good example for that is the addition of translations to the database. We cannot do that with SQL only. We need some C++ code for that.

Piling on alter database statements to come to the current version will be slow and fragile: we should not do that.

I don't agree with the "slow" point. If we drop and recreated the database, then the user will potentially drop some of his/her metadata, which recovery will take much more time than spending a dozen of seconds on performing migrations.

The point about fragility is valid, but to make it non-fragile we should just add proper testing for that. Right now we migrate in ad-hoc way without any unittesting at all. That is, what really fragile.

We shouldn't write fragile code to support going back a version

About going back in version, I guess we should consider that if and only if we find a library that provides this functionality. It is supported in most libraries from the web-world, though I'm not sure such libraries exist for C++. If we don't find such library, then we can/should skip writing that ourselves.

Our database creation code should create the latest version of the database

It means that we'll have duplication of code, which sounds a weird idea. Though it may be a solution if we add a unittest that guarantees that iterations-based and bulk-mode creation strategies create exactly the same database.

We shouldn't write fragile code to support going back a version

About going back in version, I guess we should consider that if and only if we find a library that provides this functionality. It is supported in most libraries from the web-world, though I'm not sure such libraries exist for C++. If we don't find such library, then we can/should skip writing that ourselves.

I'm not sure why would there be a need for that?

I'm mostly worried about making life miserable for our beta testers. That would mean only going back one version (or, as many versions as we put in between the latest stable release and the beta version (maaaybe latest stable or latest nightly). Recently we just add new columns and new things. Problem would be that removing a whole column would mean removing data, and we can't state the version is 0.0.1 if it has columns that belong in 0.0.2 because then migration from 0.0.1 to 0.0.2 wouldn't work (I say columns but it can be other things). So I'm not sure. Thoughts?

I'm mostly worried about making life miserable for our beta testers. That would mean only going back one version

Well, if you support moving back one version, you automatically support moving back to as many versions you like. That is, when adding a new database version, you need to add something like an undo command, which rolls forward and backward the database. And to limit this rollback to one version only, you just need to remove the undo branches from all the commands for previous versions, which is a bit strange pipeline.

Another questions is, what happens if we update the database twice during the same beta-testing phase? It will still make the life of testers miserable :)

Btw, here is the explanation of how it works in Ruby on Rails: https://guides.rubyonrails.org/active_record_migrations.html

Basically, they have extremely high-level language for defining migrations, which allows auto-generation of undo-steps. You just write "add a column with these fields" and it automatically generates redo and undo steps.

I don't think we have enough motivation to write such extensive system ourselves, but if we found a library allowing us to do that, we could use it for our benefit. The only price for us would be to write proper tests for these migrations (which we need to write anyway).

Google says that the most popular Python-based migrations library (after Django) is Alembic. It also has undocommand-like API, though it doesn't have auto-generation of the undo steps.

https://alembic.sqlalchemy.org/en/latest/tutorial.html#running-our-first-migration

dkazakov added a comment.EditedFeb 23 2022, 7:20 AM

And it seems like there is a golang-based tool that allows auto-generation of SQL files for migrations (redo+undo steps). We could actually use that: generate undo/redo SQL files with the tool and add them into the C++ wrapper that fills in the data in the new columns if needed.

https://github.com/golang-migrate/migrate/blob/master/database/postgres/TUTORIAL.md

It's not only about the database schema though. For example, one of the recent changes in the database schema was adding the "active" column in resources_tags database. So, if someone uses the current stable nightly to tag and untag resources, and would go back to before that change, the column would disappear, right? So then all tag-resources pair would denote that the resource is tagged, even though the user untagged it. And there can be other unforeseen now issues in the future.