[Akonadi] ItemFetchJob size limit
Open, Needs TriagePublic

Description

We want to be able to retrieve only limited amount of Items from a Collection with the ItemFetchJob. This would likely involve extending the ItemFetchJob API to allow callers to specify how many Items they want to receive, extending the Akonadi protocol (by modifying the protocol.xml file) to allow sending the limit to the server as part of the command, and then adjusting the SQL queries in FetchHelper in the Akonadi server to actually honor the limit.

The last part may be a little tricky because of the way the queries are created, but in the worst case we query more than we need and just throw part of the results away.

dkurz added a subscriber: dkurz.Oct 31 2018, 5:33 PM
kondzio assigned this task to dvratil.Sep 21 2020, 3:11 PM
kondzio added a subscriber: kondzio.

Hello Daniel,
I would like to try to implement this task. Can this task be implemented separately, or is it somehow related to the task T645?
Do you have any information that would make it easier for me to start coding? because currently I have little knowledge about Akonadi.

Thanks,
Konrad.

Now I understand how this should be implemented, more or less. Daniel if You only answer my first question I think I could start working on it.

Hi Konrad! This may be a pre-requisite for T645, but it is not directly related to it.


Design questions

First, we need to figure out HOW we want to limit the retrieved Items: do we want get N newest items, N oldest items, N items with highest ID, N items with lowest ID? Do we also want to be able to specify "lower" limit to be able to fetch a particular range of items (e.g. 1-100, 101-200, ...)?

I think it makes sense that the ItemFetchJob API allows for the following: sortOrder, limit, start. I think it could even be a single method, I'd propose something like

void ItemFetchJob::setLimit(int limit, int start = 0, Qt::SortOrder order = Qt::DescendingOrder)

By default this would retrieve limit newest Items with possibility for further customization.


Implementation on client side

I think the first place to start is to extend the protocol definition (in src/private/protocol.xml) and adjust ItemFetchJob API to allow specifying the limit and filling it in the protocol, so that the information is sent to the server.

Implementation on server side

To better understand the server-side code and the terminology used there, I'd recommend you read this Architecture overview first.

On the server side, this will be more tricky: look into src/server/handler/itemfetchhelper.cpp to understand how the code that retrieves the items from the database actually works. You will see that in order to obtain an entire Item, we need to receive multiple data per each Item: parts, flags, tags, virtual references (virtual collections the item may be linked to, vrefs) and the item itself. This is done through 5 independent SQL queries: in each query we query the PimItemTable (filtered based on the Fetch command "scope") and LEFT JOIN the respective table to get list of parts, flags, tags and vrefs for each PimItem. The results are always sorted in descending(*) order by the PimItem.id. After all five queries are executed, we use a while loop to iterate over the results of the PimItem table query and in each iteration we also iterate through results from the additional queries to get the additional data to assemble the entire Item. Here's the catch: the additional queries are always 1:N relations - in other words, single PimItem can have 0 to many parts, flags, tags and vrefs. So if the ItemFetchJob declares it only wants 100 Items, you can use SQL LIMIT 100 to limit the number of PimItems you receive, but you can't use LIMIT 100 to limit the number of flags your receive, because there can be multiple flags per Item, so you would be missing some. This is the tricky part.

So right now we do the following query to get flags:

SELECT ... FROM PimItemTable WHERE .... LEFT JOIN PimItemFlagRelation ON (PimItemTable.id = PimItemFlagRelation.pimitem_id) ORDER BY PimItemTable.id DESC

I think the only feasible solution right now is to change that to a subquery:

SELECT ... FROM (SELECT * FROM PimItemTable WHERE ... LIMIT ${start}, ${limit} ORDER BY id ${direction}) AS pimItems LEFT JOIN PimItemFlagRelation ON (pimItems.id = PimItemFlagRelation.pimitem_id) ORDER BY pimItems.id ${direction}

However, we could optimize this and create a temporary view with the filtered, limited results on the PimItemTable first, then use the view in the additional queries, then destroy the view again. This might actually improve the performance a fair bit. Actually now that I think about it, this could be a completely separate task that I would probably do as the very first thing (without any support for limiting the results) on the existing code to check whether it's a feasible approach and if it may have a notable impact on performance, and implement that (and possibly even merge that to master independently of the rest of this task). Then I'd go back to extend it with support to limit the results etc.


(*) Why descending? This is an optimization to display the newest Items first - e.g. in KMail you want to see the newest email ASAP and don't care how long it takes for the email from 10 years ago to display...

kondzio added a comment.EditedOct 28 2020, 1:25 AM

Hello Daniel,
Could You please advise me ?
The main problem for me with implementing these changes is that every function that builds its own query in ItemFetchHelper, for example buildPartQuery finally calls ItemQueryHelper::scopeToQuery(mScope, mContext, partQuery)
All these helper functions in ItemQueryHelper: scopeToQuery, gidToQuery, remoteIdToQuery, itemSetToQuery which generate additional where clauses have hardcoded table and column names for PimItem and are also used in many other places not only in ItemFetchHelper.

Therefore, I think that it would be necessary to somehow pass to all these helper functions information about the table name for PimItem, which we will refer to by the QueryBuilder class,
because if I understand correctly in this case of calls from ItemFetchHelper we would like to use view and for all other places the main table "PimItemTable".

We can also add a public function in QueryBuilder that would allow to read the mTable field and match the names of tables and columns on this basis.
I have a dilemma which solution will be the most appropriate, or maybe there is some other way that I missed?

Thanks,
Konrad.

dvratil added a comment.EditedOct 29 2020, 2:10 PM

Hi,

that's a good point. I would say that adding a getter to QueryBuilder to get the mTable field and using that in QueryHelpers is a good design decision.

Hello Daniel,
I encountered another problem while working on this task.
How are we should creating a temporary view? from what I know MYSQL doesn't support temporary views.
Initially, I tried to create a view based on a query that returns the buildItemQuery function only if ItemFetchJob had a limit greater than zero,
and when it was no longer needed I removed it at the end of the ItemFetchHelper::fetchItems function.
Next I added the functionality which prepending id to the view name, because each connection/thread on the server may have a different limit at the same time, and I also added a mutex locking functionality when each new view is creating.
In this approach I see one potential risk in situation when exception occurs (e.g. closing connection by client) between creating and deleting view, then created view will never be deleted.
Can You help me how to deal with it? or maybe there is an easier way to do it?

Thanks,
Konrad.

knauss added a subscriber: knauss.Feb 7 2021, 12:21 PM

But why create temporary views? No human is creating the queries. Views do not improve the speed in anycase. If you only construct a view for one request it only makes the resulting query nicer to read. But no speed improvement. I think the first idea of @dvratil for the query is fine:

SELECT ... FROM (SELECT * FROM PimItemTable WHERE ... LIMIT ${start}, ${limit} ORDER BY id ${direction}) AS pimItems LEFT JOIN PimItemFlagRelation ON (pimItems.id = PimItemFlagRelation.pimitem_id) ORDER BY pimItems.id ${direction}

I thought Daniel recommended creating a temporary view, because he wrote in his last statement:
"However, we could optimize this and create a temporary view with the filtered, limited results on the PimItemTable first, then use the view in the additional queries, then destroy the view again ".
It seemed to me it has sense because there is currently no functionality that the QueryBuilder class can use to nest a subquery in the main query FROM clause, unless I missed something.

knauss added a comment.EditedFeb 7 2021, 6:37 PM

@kondzio : yes that is what Daniel recommended, but a temporary view cannot be optimized.

It seemed to me it has sense because there is currently no functionality that the QueryBuilder class can use to nest a subquery in the main query FROM clause, unless I missed something.

Well we need this functionality ;) It should be not that difficult to implement to accept subquery as tables.

Do you need more help to implement this to QueryBuilder?

I think I can handle it. Thanks for help.

Makes sense that views don't optimize. I was hoping we could avoid scanning the large PimItemTable multiple times and just reuse the list of pim items in the flags/parts/etc. queries. I guess there's no fast way to do that without stored procedures (and creating a temporary table would be too expensive).

The question is whether subquery will really be faster than the current state - I'm afraid the SQL optimizers are smart enough to figure out what we are trying to do and will basically end up with the same execution plan.

MariaDB [akonadi]> EXPLAIN SELECT PimItems.id, PartTable.* FROM (SELECT id FROM PimItemTable WHERE collectionId = 24 ORDER BY id DESC) AS PimItems INNER JOIN PartTable ON (PartTable.pimItemId = pimItems.id);
+------+-------------+--------------+------+---------------------------------------------------------------+------------------------------+---------+-------------------------+------+-------------+
| id   | select_type | table        | type | possible_keys                                                 | key                          | key_len | ref                     | rows | Extra       |
+------+-------------+--------------+------+---------------------------------------------------------------+------------------------------+---------+-------------------------+------+-------------+
|    1 | SIMPLE      | PimItemTable | ref  | PRIMARY,PimItemTable_collectionIndex,PimItemTable_idSortIndex | PimItemTable_collectionIndex | 9       | const                   | 5478 | Using index |
|    1 | SIMPLE      | PartTable    | ref  | PartTable_pimItemIdTypeIndex,PartTable_pimItemIdSortIndex     | PartTable_pimItemIdTypeIndex | 8       | akonadi.PimItemTable.id | 1    |             |
+------+-------------+--------------+------+---------------------------------------------------------------+------------------------------+---------+-------------------------+------+-------------+
2 rows in set (0.000 sec)

MariaDB [akonadi]> EXPLAIN SELECT PimItemTable.id, PartTable.* FROM PimItemTable INNER JOIN PartTable ON (PartTable.pimItemId = PimItemTable.id) WHERE PimItemTable.collectionId = 24 ORDER BY PimItemTable.id DESC;
+------+-------------+--------------+------+---------------------------------------------------------------+------------------------------+---------+-------------------------+------+--------------------------+
| id   | select_type | table        | type | possible_keys                                                 | key                          | key_len | ref                     | rows | Extra                    |
+------+-------------+--------------+------+---------------------------------------------------------------+------------------------------+---------+-------------------------+------+--------------------------+
|    1 | SIMPLE      | PimItemTable | ref  | PRIMARY,PimItemTable_collectionIndex,PimItemTable_idSortIndex | PimItemTable_collectionIndex | 9       | const                   | 5478 | Using where; Using index |
|    1 | SIMPLE      | PartTable    | ref  | PartTable_pimItemIdTypeIndex,PartTable_pimItemIdSortIndex     | PartTable_pimItemIdTypeIndex | 8       | akonadi.PimItemTable.id | 1    |                          |
+------+-------------+--------------+------+---------------------------------------------------------------+------------------------------+---------+-------------------------+------+--------------------------+
2 rows in set (0.001 sec)

well if limit of the PimItem.id comes into the game we need a subqueries, anyways.

Planners are smart, but if we can make sure, that the working set small from the beginning, it is always better. The question is if the planner is smart enough to limit the PimItems before doing the JOIN. If not than the temporarily dataset is getting very big and than afterwards the where clause reducing the working set. That's why a big debate is ongoing, if you should not use JOIN but plain WHERE. When you use subquery the planner needs to do the reduction of pimitems first in any case. That's why the subquery variant should be preferred, as we are dealing with three different database systems with different planers, so a variant, that doesn't depend heavily on a good planer is better ;) If the planner is good, than both queries would have the same speed.

You want to reuse the list of pimItems in a second call. Now I get it what you mean with temporary view - why don't store the list of ids in an C++ object and than call:

SELECT * FROM PartTable WHERE PartTable.pimItemId in (1,2,3,4,5);

Okay and if you don't want to store this list inside C++, you can a create temporarily table, where you store the result of the the PimItem query, than you make sure to do the request only once. This reduces execution time, because we only call the db once - but you store more data into the database (okay in RAM, as temporary tables will be destroyed with end of the session). I think it depends very much how big the chunks are we get via LIMIT and if we later on can reuse the temp table, for return the next chunk of ids. If we can reuse the temp table for the next request, than this will improve the speed, as we do not need to scan through the PimItems table again. One downside with the long living temp table, if a new mail arrives the temp table is not getting updated. If we talk only about ~1000 items I can think storing them in C++ is the better approach as database systems are optimized to select not to to insert data. On the other side temporary tables are made to store intermediate results, so maybe I'm wrong and they are very fast:

CREATE TEMPORARY TABLE pimitems_coll_24 SELECT id FROM PimItemTable WHERE collectionId = 24 ORDER BY id desc;
SELECT * FROM PartTable WHERE PartTable.pimItemId in (SELECT id FROM pimitems_coll_24 LIMIT 100);

In my opinion approach with subquery is better because I made a small comparison with a similar query on "Employees Sample Database" from dev mysql site.
I used similar queries to ours, because I don't know how to do execution plan with estimating cost on the lower version of MYSQL,
but from what I see limiting set is done before doing JOIN and the final query cost is much lower than in the second case.

kondzio added a comment.EditedApr 12 2021, 6:12 PM

Hello Daniel,

I prepared changes on my local repository for this task where I added an additional constructor to the QueryBuilder class, which takes a QSqlQuery object as a subquery in the FROM clause, but such solution has one limitation, because from what I see adding test to QueryBuilderTest probably will not be possible.

I also think that binding values from the subquery should be inserted immediately into the string, because placeholders in subquery in FROM clause may duplicate with those that potentially being inserted earlier in aggregate functions or later when building the where conditions.

I tested these changes on the MYSQL and SQLite database, everything works fine, I also checked the fetching statics in the ItemFetchHelper class and for these queries
which have set limit, they executing faster, of course, according to the amount of the limit.

Could you take a look at the patch with my changes and say what you think about it ?

Thanks,
Konrad.

It would help, if you would summit a merge request on invent.kde.org: https://invent.kde.org/pim/akonadi/-/merge_requests
From first glance the patch seems fine. But we will look in more detail, if you summited a merge request, as it has a much better interface to give feedback.

Ok, I'll create new merge request soon. Thanks.