akonadi > MySQL ERD Review
Open, NormalPublic

Description

This is an initial review of the Akonadi/MySQL Entity-Relationship Diagram (ERD).

Model Notation
Crow's Foot

Crow's Foot Refresher
The symbols at the end of each relationship define the minimum and maximum cardinality:

  • | - one
  • o = zero
  • < = many

For example given the following:

[parent] -------o< [child]
  • Each [parent] may have zero to many [child]'s
  • Each [child] may have one and only one [parent]

Observations

  • There are many standalone entities (e.g. [collectionpimitemrelation], [tagtypetable], [parttable], etc). I would think that somehow they would be related to some other entity.
  • Perhaps this is my own ignorance of Akonadi but at first blush, I see several islands of relationships. For example, here are couple:
    1. [resourcetable]/[tagremoteidresourcerelationtable]
    2. [collectiontable]/[pimitemtable]/[relationtable]/[pimitemtagrelation]/[mimetypetable]
  • The relationship [mimetypetable]/[pimitemtable] is defined such that we can have a [pimitemtable] row with no [mimetypetable] defined. I suspect that this is per specification but I want to be sure. :)
  • Minor - personally, I don't like to embed the type of object in the name of the object. For example, collectiontable. This is admittedly a nitpick. ;)
  • Minor - personally, it would be great to use underscores to separate names. I find it far easier to read [tag_remote_id_resource_relationship_table] than [tagremoteidresourcerelationshiptable] This is yet-another-nitpick.

Actionables

Outcomes of the current discussion in no particular order of importance or severity. Can be crossed-out once implemented/fixed.

  • Eventually change Associative Tables that have Unique Key to Primary Key (like the rest) (T7846#126304)
  • Drop unnecessary indexes from [CollectionTable] (T7846#126304)
  • Bool-like columns should be NOT NULL (T7846#126596)
  • Move [CollectionAttributeTable].type to it's own table to de-duplicate the strings, same for [TagAttributeTable] (T7846#126596, T7846#126615)
  • [PimItemTable].mimeTypeId, [PimItemTable].collectionId and [RelationTable].typeId FKs should be NOT NULL (T7846#126646)

Related Objects

StatusAssignedTask
OpenNone
Openpablos
pablos created this task.Jan 31 2018, 8:09 PM
pablos triaged this task as Normal priority.
pablos updated the task description. (Show Details)

There are some relations missing in the PDF, for example CollectionPimItemRelation - as the name suggests - holds n:m relations between CollectionTable and PimItemTable. I can see those relation in the mysql schema I sent you (i.e. CREATE TABLE CollectionPimItemRelation has the respective CONSTRAINTS). Did maybe something go wrong during your import?

Minor - personally, it would be great to use underscores to separate names. I find it far easier to read [tag_remote_id_resource_relationship_table] than [tagremoteidresourcerelationshiptable] This is yet-another-nitpick.

We use a single XML file to generate both the DB schema and the ORM C++ code, and since we use CamelCase names in C++, the generated SQL schema also uses the same system, but the capitals get removed when imported in SQL I suppose.

To my own surprise, I just discovered we generate a simple ERD as part of Akonadi documentation: https://api.kde.org/kdepim/akonadi/html/akonadi_server_database.html - you should see the missing relations there. The only lonely "island" is the SchemaTable, which is there simply to contain version of the current schema.

pablos added a comment.EditedJan 31 2018, 8:56 PM

Whaddya know! :) .. however the ERD in the documentation doesn't quite match the table names. I suppose it's like a Logical Model

For example, where we have Flag we have flagtable in the DDL. Also, all the camel case is lost on the table names. I do see it on the columns. A bug of sorts?

I'm going to regen the ERD because for me, I need to be able to see the physical implementation.

Edit

However, the very good thing is we should have relationships that clearly my ERD tool didn't pick up properly from the DDL. I'm going to point it to my Neon set up.

Thanks Dan!

pablos added a comment.EditedFeb 1 2018, 2:20 AM

A good way to easily share the MySQL local Instance.

Googling around, I found this nifty hack to map a TCP/IP port to the socket:

#
# Listen on port 12345 and pass it off to the local socket.
#
socat TCP-LISTEN:12345,fork UNIX-CLIENT:/var/tmp/akonadi-pablo.LmJY53/mysql.socket
pablos added a comment.Feb 2 2018, 1:38 AM

Round #2

I've spot-checked the ERD's and they appear to be correct based on the provided SQL. In fact, I pointed the ERD tool to my Neon VM and told it to reverse engineer the run-time akonadi database.

Observations

[partTypeTable]

  • The Alternate Key is on (name, ns). Is it necessary to use both columns?

Associative Tables do not have a Surrogate Key - slight anti-pattern

A logical many-to-many relationship is physically implemented by creating an Associative Table. For example, in the ERD PIMItemTagRelation is such a table as it resolves the many-to-many between [PIMItemTable] and [TagTable].

The anti-pattern is that most (not all) of these tables do not have a surrogate key (e.g. id defined as an identity data type) as the other tables.

Furthermore, If an associative table is a parent to another table, then it does gets a surrogate id. I understand the variance.

An additional issue with these tables, those without a surrogate id, is the primary key is composed of the foreign keys. Nearly all of these tables have two parents so we end up with an index like the following: (FK1, FK2). The problem is there is no supporting index should we Join only with FK. To solve that, the model defines an index on FK2.

The exception to the above is [RelationTable], it has three parents. As expected, the composite key is on its three FK's and two additional indexes, one on FK2 and one on FK3 is created.

What I recommend we (eventually) do is:

  • Ensure we can drop the existing, composite Primary Key
  • Add a surrogate id to each table and make it the Primary Key
  • Create separate indexes on each Foreign Key

By doing the above, the optimizer/planner can determine the most selective index to use when determining an execution plan for a query. Additionally, all the tables are consistent which means any functionally dependent code is also consistent.

[CollectionTable]

  • I would investigate further whether the index on (name, parentId) is necessary. It is possible it's necessary if we have an extremely high fan-out from the parent to the next level of children (e.g. 20, 50, 100 next-level children)?
  • Generally indexes on low cardinality columns (e.g. gender) are not helpful. Unless of course one is fetching only by that column and perhaps a small percentage of the data qualifies by the value. With this in mind, I would investigate if we need indexes on enabled, syncPref, displayPref and indexPref. I suspect we should be able to drop them. I'm curious to know the SQL which uses these columns. I suspect other columns are used which would have existing supporting indexes.

[partTypeTable]

The Alternate Key is on (name, ns). Is it necessary to use both columns?

Yes, type names have two parts: a namespace and a name and as such the combination of those two has to be unique, which is what the key enforces. Namespace determines the type of the part (generally it's either "PLD" for payloads or "ATR" for attributes) and name is the actual name of the type. The reason for the split is that sometimes you want to query for example all attributes belonging to an Item, in which case you can do a query like

SELECT * FROM PartTable LEFT JOIN PartTypeTable ON PartTypeTable.id = PartTable.partTypeId WHERE PartTypeTable.ns = 'ATR' AND PartTable.pimItemId = ?;

Associative Tables do not have a Surrogate Key - slight anti-pattern

What would be the benefit of having a Surrogate key for those relational tables? The reason for the composite PK is to enforce uniqueness of the value pairs in those tables, and from our point of view, the value pair is the unique identifier of the relation. There is no need to be able to refer to such relations by a single identifier.

What I recommend we (eventually) do is:

  • Ensure we can drop the existing, composite Primary Key
  • Add a surrogate id to each table and make it the Primary Key
  • Create separate indexes on each Foreign Key

How about we only create separate indexes on each Foreign Key to make work easier for the planner, but we keep the composite PK (or turn it into a regular UNIQUE INDEX)? The alternative is introducing the surrogate id as a PK and adding a UNIQUE index on the Foreign Key pairs to enforce uniqueness but that seems like a lot of pointless overhead as the surrogate id would never be used and we would be introducing another index on potentially large tables.

[CollectionTable]
I would investigate further whether the index on (name, parentId) is necessary. It is possible it's necessary if we have an extremely high fan-out from the parent to the next level of children (e.g. 20, 50, 100 next-level children)?

The index is UNIQUE, thus enforcing that we don't end up with multiple Collections with the same name and the same parent (same as you can't have multiple subfolders with the same name in one folder).

Generally indexes on low cardinality columns (e.g. gender) are not helpful. Unless of course one is fetching only by that column and perhaps a small percentage of the data qualifies by the value. With this in mind, I would investigate if we need indexes on enabled, syncPref, displayPref and indexPref. I suspect we should be able to drop them. I'm curious to know the SQL which uses these columns. I suspect other columns are used which would have existing supporting indexes.

Usually, when client requests a Collection, those columns are used to filter the results by:

SELECT * FROM CollectionTable .... WHERE (CollectionTable.enabled = 1 OR CollectionTable.referenced = 1) AND ....

Now that I think about it, the variance in those columns is rather low (all the *Pref columns usually have value of 2) in which case I guess the index is indeed not very useful. For the majority of users, even the "enabled" column is true everywhere.

I have attached the latest ERD's. These are printer-friendly (no color). I had to jig the ERD tool's XML file for MySQL.

Hey Dan,

Sorry for the delay.

Overall, if we were to equate the ERD to syntax versus semantics, syntactically it's in relatively good shape - more below.

Thank you for the explanation on [PartTypeTable].

On the Associative Tables, thank you for your response. Let's leave them as they are defined. We have some inconsistencies but the net effect is the same: Primary Key versus Unique Key. From my experience, the optimizer should treat a Primary Key like a Unique Key

At some point we should change the following tables to define a Primary Key rather than a Unique Key:

  • RelationTable
  • TagRemoteIdResourceRelationTable

On [CollectionTable], the (extremely) general guideline is if we are returning more than 10% of the data, it's more efficient to perform an index scan. The actual percentage varies though. Some tricks can be applied so that if predominantly a subset of columns are always returned, an index can be used to create a mini-table The Optimizer/Planner is generally smart enough to perform a full index scan of this structure.

This method can be a bit fragile because if dependent queries add more columns, they fall off of the index and instead need to perform their full scan against the table. I generally don't like to use this piggyback hack. It's a bit brittle.

Recap

  • Eventually change Associative Tables that have Unique Key to Primary Key (like the rest)
  • Drop unnecessary indexes from [CollectionTable]

Thoughts?

pablos added a comment.EditedFeb 5 2018, 3:46 PM

Flags should be NOT NULL

I noticed [ResourceTable] has a flag named isVirtual and it is NULL'able. It is defined as a TINYINT(1) (effectively a Boolean). We'll want to fix this column so it is NOT NULL The DDL for the column has a default (zero).

Similar for Preference columns - e.g. [CollectionTable].indexPref, etc

By making the columns NOT NULL it simplifies all code. We don't have to check whether a column is set to NULL or not and/or use NULLIF() functions.

We only want to do this for columns that truly are NOT NULL candidates.

Key/Value Pair anti-pattern

The [CollectionAttributeTable] is being used as a generic key/value pair table - see why this is not a good way to go if there is heavy demand on this table.

.. and [TagAttributeTable]

dvratil updated the task description. (Show Details)Feb 5 2018, 5:12 PM
knauss added a subscriber: knauss.Feb 5 2018, 5:17 PM

The [CollectionAttributeTable] is being used as a generic key/value pair table - see why this is not a good way to go if there is heavy demand on this table.

This table is used, so that each resource can save additionla data for a collection, so far I know it is mostly requested via collectionid and a given type. Aka:

select value from CollectionAttributeTable where type="type" and collectionId=?

One type is defined in one attributeclass see. src/core/*attribute.(cpp|h) and than src/core/protocolhelper.cpp and src/core/attributefactory.cpp doing the magic.

Flags should be NOT NULL

Fair point, I added it to actionables, that should be fairly easy to fix. Just needs modifying the XML and Akonadi should be able to self-correct the schema

Key/Value Pair anti-pattern

The link lists the problems but does not offer any good advice, considering our storage is generic (we don't know the keys beforehand). What would be your advice? We can normalize the [CollectionAttributeTable].type column by moving the key names to their own table like we do in [PartTable] as suggested in the link, but that's about all that I could find there that would apply in our case.

This table is used, so that each resource can save additionla data for a collection, so far I know it is mostly requested via collectionid and a given type. Aka:
select value from CollectionAttributeTable where type="type" and collectionId=?

The most common query is actually

SELECT collectionId, type, value FROM CollectionAttributeTable WHERE type IN (....) AND collectionId IN (....) ORDER BY collectionId ASC

One type is defined in one attributeclass see. src/core/*attribute.(cpp|h) and than src/core/protocolhelper.cpp and src/core/attributefactory.cpp doing the magic.

Where the types are defined in code is irrelevant, we are looking into how to deal with this on the DB level. The implementation should not stand in the way of db optimization (otherwise we wouldn't get anywhere :-))

pablos added a comment.EditedFeb 5 2018, 5:46 PM

We can normalize the [CollectionAttributeTable].type column by moving the key names to their own table like we do in [PartTable] as suggested in the link, but that's about all that I could find there that would apply in our case.

The above suggestion is what I would first suggest.

For this discussion, let us call this new table [CollectionAttributeTableType] (boy am I clever!) The next question is whether this new table's domain is shareable by all of PIM (a global pool) or does each [ResourceTable]] get its own set of values. I don't know, I'm just asking ... :) .. and yes, I'm slowly winding my way through the per-table doc today (which is why I can even offer the table name [ResourceTable]!)

pablo@ on akonadi
12:37:11 27> select distinct type from collectionattributetable;
+----------------------------+
| type                       |
+----------------------------+
| AccessRights               |
| ENTITYDISPLAY              |
| SpecialCollectionAttribute |
+----------------------------+
3 rows in set (0.01 sec)

pablo@ on akonadi
12:37:48 28> select type, count(*) from collectionattributetable group by type;
+----------------------------+----------+
| type                       | count(*) |
+----------------------------+----------+
| AccessRights               |        5 |
| ENTITYDISPLAY              |        9 |
| SpecialCollectionAttribute |        7 |
+----------------------------+----------+
3 rows in set (0.01 sec)
dvratil added a comment.EditedFeb 5 2018, 6:41 PM

I would call it [CollectionAttributeTypeTable] to keep consistency with the other tables :)

I'm not sure what you mean by "does each [ResourceTable] get its own" - there's only one [ResourceTable] - if you are asking if each Resource in the [ResourceTable] gets its own "domain" then no, all Resources share the same table space.

[Edit: added an actionable for this task]

Btw:

akonadi=# select encode(type::bytea, 'escape') as type, count(*) from collectionattributetable group by type;
            type            | count
----------------------------+-------
 noselect                   |   196
 favorite                   |     3
 ENTITYDISPLAY              |   103
 uidvalidity                |   194
 highestmodseq              |   186
 KAlarmCollection           |     3
 collectionquota            |   194
 ctag                       |    10
 SpecialCollectionAttribute |     8
 uidnext                    |   194
 KAlarmCompatibility        |     3
 imapquota                  |   194
 PERSISTENTSEARCH           |     2
 AccessRights               |   860
 BlockAlarmsAttribute       |    95
 imapacl                    |   188
 davprotocol                |    10
 collectionidentification   |   818
 collectionflags            |   194
 newmailnotifierattribute   |   690
 collectioncolor            |    32
 collectionannotations      |   747
(22 rows)

akonadi=# select count(*) from collectionattributetable;
 count 
-------
  4924

Notice how consistent we are in naming the attributes ;-)

dvratil updated the task description. (Show Details)Feb 5 2018, 6:45 PM
pablos added a comment.Feb 5 2018, 6:51 PM

I'm not sure what you mean by "does each [ResourceTable] get its own" - there's only one [ResourceTable] - if you are asking if each Resource in the [ResourceTable] gets it's own "domain" then no, all Resources share the same table space.

I'm sorry for not being clear. What I was asking is whether say Korganizer would want its own set of values in [CollectionAttributeTypeTable] versus Kmail. Or would all Akonadi Client's share the values in this table?

Most often the attribute types are internal to the Resource that created the Collection and only the Resource understand what the value means (e.g. all of uidvalidity, highestmodseq, collectionquota, uidnext, imapquota and imapacl attributes are only understood by the IMAP resource which creates them). Some attributes are known to clients too (e.g. ENTITYIDISPLAY which describes the Collection display name and icon) and it is desirable that those are shared between all clients. If e.g. KOrganizer wants to store some extra data that are not relevant to any other client, it can create a new attribute type and put the data in there.

pablos added a comment.Feb 5 2018, 7:00 PM

Ah, so do we want to do this as it gives each [ResourceTable] row the option to create its own [CollectionAttributeTypeTable] rows.

Perhaps it's not needed but I thought I'd offer it up.

[ResourceTable] -> [CollectionAttributeTypeTable] -> ...

Not needed. This would not really work anyway because some well-defined attribute types can be used by multiple Resources or even by clients and we don't keep track of clients in the DB.

pablos added a comment.Feb 5 2018, 8:24 PM

Optional Parent FK's

The tables listed below have NULL'able Foreign Keys. Is this intended?

Generally I like to make the DBMS enforce as much Referential Integrity as possible.

Sometimes it doesn't make sense to make FK's NOT NULL so ... that's why I'm asking. ;)

Note: recursive relationships must have a NULL'able FK so we're good on that front.

  • [MimeTypeTable] -> [PIMItemTable]
  • [CollectionTable] -> [PIMItemTable]
  • [RelationTypeTable] > [RelationTable]
dvratil updated the task description. (Show Details)Feb 5 2018, 9:02 PM

Indeed all of the cases should be NOT NULL. Actionable'd.

dvratil updated the task description. (Show Details)Feb 5 2018, 9:05 PM