Add standalone conversion functions for PropertyMap to Json and vice versa
ClosedPublic

Authored by bruns on Feb 17 2019, 1:09 AM.

Details

Summary

Baloo caches property data, and uses Json for serializing it. For storing,
it populates a QVariantMap, where multiple values for the same key are
stored as a QVariantList (QJsonObject::fromVariantMap only handles one
value per key). After deserialization, the KFileMetaData::PropertyMap
also contains QVariantLists for keys with multiple values.

To handle multiple property values per key correctly, do the serialization
and deserialization manually. This also save the temporary QVariantMap
on deserialization.

Add unit tests to test roundtrip from PropertyMap to JSon and back. This
is now possible as the serialization code is no longer done inline.

It makes it also easier to keep serialization and deserialization in
sync, as it is no longer in different files (src/file/result.cpp and
src/file/extractor/file.cpp).

Test Plan

ctest

Diff Detail

Repository
R293 Baloo
Lint
Automatic diff as part of commit; lint not applicable.
Unit
Automatic diff as part of commit; unit tests not applicable.
bruns created this revision.Feb 17 2019, 1:09 AM
Restricted Application added projects: Frameworks, Baloo. · View Herald TranscriptFeb 17 2019, 1:09 AM
Restricted Application added a subscriber: kde-frameworks-devel. · View Herald Transcript
bruns requested review of this revision.Feb 17 2019, 1:09 AM

I was wondering if using QDataStream would be simpler, but the data size was over twice the amount compared to the size using JSON during my testing. But this would solve a lot of the limitations which JSON imposes, like handling int and doubles, qdatetime objects, and there would be no need for merging/unmerging items after fixing KFileMetaData.

autotests/unit/file/propertyserializationtest.cpp
3

very minor nitpick: missing *
also below

103

A test with stringlists and single strings would be nice. Also doubles should be checked.

src/file/propertydata.cpp
81

This means that multiple entries with the same key get stored as single entries in the property map, right?
This breaks the current behavior in dolphin and baloo-widgets. Merging (or more specific D19088) must then wait until we convert all those data to output stringlists. But I think this code will also split up data of stringlists, doesn't it?
Also, if there are multiple extractors for the same mime type (which I know there aren not currently) , which will add e.g. two releaseYear properties, these will be converted to a string.

84

I think that will break all properties containing doubles, of which there are a few.

bruns updated this revision to Diff 51928.Feb 17 2019, 8:41 PM
bruns marked an inline comment as done.

Add double test case
Fix double conversion
Fix nitpicks ('*')

bruns marked an inline comment as done.Feb 17 2019, 8:43 PM
astippich requested changes to this revision.Feb 19 2019, 8:51 PM

The stringlist case is still not handled. And a test case for multiple ints (2 ReleaseYear properties) will also likely fail. When going with JSON, I think this has to be done by always inserting values as an array (even if there is only one value), and removing this during deserialization. Stringlist will then be an array in an array in JSON. That will ensure that the map is of the same structure as it was put in, and this should also be possible to do in a backwards compatible way.

What do you think of using QDataStream here? This would require a db version bump, of course.

src/file/propertydata.cpp
84

Now ints are converted to doubles, which is not fatal, but IMHO also not very clean either. But since JSON only knows doubles, this is what we get. Only way out in this case is to query PropertyInfo::valueType, and do the conversion based on this.

This revision now requires changes to proceed.Feb 19 2019, 8:51 PM

The stringlist case is still not handled.

Thats the reason I have not marked the comment as done ...

And a test case for multiple ints (2 ReleaseYear properties) will also likely fail. When going with JSON, I think this has to be done by always inserting values as an array (even if there is only one value), and removing this during deserialization. Stringlist will then be an array in an array in JSON. That will ensure that the map is of the same structure as it was put in, and this should also be possible to do in a backwards compatible way.

What do you think of using QDataStream here? This would require a db version bump, of course.

Not too much:

Versioning

QDataStream's binary format has evolved since Qt 1.0, and is likely to continue evolving to reflect changes done in Qt. When inputting or outputting complex types, it's very important to make sure that the same version of the stream (version()) is used for reading and writing. If you need both forward and backward compatibility, you can hardcode the version number in the application:

bruns marked an inline comment as done.Feb 20 2019, 11:54 PM
bruns added inline comments.
src/file/propertydata.cpp
81

Actually, I am not sure if we need the stringlists as a separate type.. What is the difference between multiple strings in a stringlist, and multiple strings as individual dict entries?

What is the semantic difference between the two?

84

Thats what also happens in the current code.

astippich added inline comments.Feb 23 2019, 3:23 PM
src/file/propertydata.cpp
81

Please incorporate support for stringlists.
All clients currently rely on having string lists (or variantlists) as output for multiple entries. It is also what is advertised in KFileMetaData.
None of the clients query for multiple entries in the PropertyMap, Dolphin and Baloo-widgets will break without this for multi-value properties. All of them add and format the output on a per property basis, so querying for multiple keys requires significant changes.
IMHO all extractors should only add one property for each key to the map. If there are duplicate entries, this means that multiple extractors added results (with possible duplication).
Having it as string lists also makes the handling of them so much easier. It can be handled with one call to QVariant::toStringList().join().

84

True, and I said the same quite often in other reviews. But you did not let this count for me either :)
Anyway, not a blocker for me.

bruns updated this revision to Diff 52401.Feb 23 2019, 8:26 PM
bruns marked an inline comment as done.

add support for multiple integer values
add multiple integer test

bruns added inline comments.Feb 23 2019, 8:27 PM
src/file/propertydata.cpp
81

All clients currently rely on having string lists (or variantlists)

Actually, string lists make the code more complicated, as the code already has to handle variant lists. There are no QDoublelists, QIntlists ... baloo-widgets treats a QVariantList with string entries exactly like a QStringlist, and dolphin uses baloo-widgets.

It is also what is advertised in KFileMetaData.

KFileMetaData does not specify in any way how values for repeated properties are retuned. It explicitly allows calling ExtractionResult::add() with the same property multiple times, but that's it. PropertyMap is just a QMap<> typedef, and does not specify any behavior. For me, it seems much more natural to expect multiple individual entries for the same key, which should be retrieved with QMap::values().

Currently, baloo-widgets (https://cgit.kde.org/baloo-widgets.git/tree/src/filemetadatawidget.cpp#n152) only uses data[key], i.e. it only retrieves the first value, but fixing this is a single line change (i.e. replace data[key] by QVariant(data.values(key))).

Having it as string lists also makes the handling of them so much easier. It can be handled with one call to QVariant::toStringList().join().

QVariant::toStringList() also works on QVariantLists - https://doc.qt.io/qt-5/qvariant.html#toStringList

84

Only if it can be fixed it a straight-forward and trivial way, or when it is a regression. This is IMHO not the case here.
Also, PropertyInfo::valueType specifies QString for several properties where multiple values are completely reasonable.

I get the feeling that we need to discuss how the PropertyMap should look like and how we achieve a seamless transition in a backwards compatible way.
I have slowly been working to resolve T8196, which is why I am so insisting. This is the last missing piece.
Currently, the clients expect a list in the map when there are multiple values, probably since the inception of baloo. All other entries are simply ignored. If the map is changed to having multiple entries with duplicate keys, then we have to adjust all clients and make sure that all possible combinations of applications and frameworks continue to work.
Additionally, baloo's serialization/deserialization does currently alter the structure of the map. Multiple values with the same key will be merged into a list afterwards. This causes issues in baloo-widgets for example, which breaks for multiple properties when baloo indexing is off. Baloo-widget then uses KFM directly, which results in a different PropertyMap.
Considering this, my plan was that to insert multiple values as single entry containing a list. Once all extractors are converted, the merging of multiple values into lists, currently done before serialization in baloo, will be removed. This yields a backwards compatible way of fixing the issue with multiple entries (That is also what D17302 is for). This demands that the same list is retrieved after deserialization, with no merged values, e.g. additional entries in the list, during the serialization.

So there are a few things that I would like to consider:

  • The same PropertyMap is generated either by baloo or KFM directly. Keeping the PropertyMap the same is only possible to a certain extend with JSON, e.g. with ints becoming doubles etc...
  • Extractors will insert multiple entries as a list.
  • Multiple extractors for the same mime type may insert duplicate entries. This would allow clients to easily differentiate between data from multiple extractors in the future.
src/file/propertydata.cpp
81

Actually, string lists make the code more complicated, as the code already has to handle variant lists. There are no QDoublelists, QIntlists ... baloo- widgets treats a QVariantList with string entries exactly like a QStringlist, and dolphin uses baloo-widgets.

Dolphin has its own handling for the detailed view and does not use baloo-widgets there.

I think here is actually a misunderstanding, by supporting string lists I meant that if one inserts a list, one also gets the same list after deserialization. I actually do not care if it is a string list before and a variant list afterwards, that is how the current code already works.

KFileMetaData does not specify in any way how values for repeated properties are retuned. It explicitly allows calling ExtractionResult::add() with the same property multiple times, but that's it. PropertyMap is just a QMap<> typedef, and does not specify any behavior. For me, it seems much more natural to expect multiple individual entries for the same key, which should be retrieved with QMap::values().

You are right, ExtractionResult explicitly states that multiple calls are allowed. IMHO this should be more specified to "can be called multiple times from different extractors".

Currently, baloo-widgets (https://cgit.kde.org/baloo-widgets.git/tree/src/filemetadatawidget.cpp#n152) only uses data[key], i.e. it only retrieves the first value, but fixing this is a single line change (i.e. replace data[key] by QVariant(data.values(key))).

QVariant::toStringList() also works on QVariantLists - https://doc.qt.io/qt-5/qvariant.html#toStringList

I know, otherwise the current code would not work at all.

84

Which property? I am only aware of Composer, which I plan to change.

bruns updated this revision to Diff 54768.Mar 25 2019, 1:36 PM

add list tests

bruns marked an inline comment as done.Mar 27 2019, 12:11 AM
astippich requested changes to this revision.Mar 30 2019, 8:41 AM

This still alters the map during deserialization. Lists are converted into multi single properties afterwards. As I said, due to backwards-compatibility lists should stay lists.

autotests/unit/file/propertyserializationtest.cpp
56

wrong comment, also everywhere below

131

nitpick: the rating is usually normalized to 0 to 10, so using 20 here is a little bit strange

This revision now requires changes to proceed.Mar 30 2019, 8:41 AM
bruns added a comment.Mar 30 2019, 5:16 PM

This still alters the map during deserialization. Lists are converted into multi single properties afterwards. As I said, due to backwards-compatibility lists should stay lists.

  1. No extractor inserts lists, just single values.
  2. Extractors are allowed and do insert different values for the same property.

-> Lists do not exists as property values. Supporting lists were a new feature.

Both SimpleExtractionResult::properties() (used by the temp extractor) and Baloo::File::properties() return a KFileMetadata::PropertyMap, and it seems only logical both behave likewise. SimpleExtractionResult::properties() already returns the properties one by one, now Baloo::File::properties() does the same.

The current behavior of the Baloo serialization (and thus the deserialization result) is illogical - when you insert a single value, you get back a single non-list QVariant, when inserting multiple values, this is silently converted to a QVariantList. The return type depends on the number of values. The SimpleExtractionResult behavior is consistent - PropertyMap::find() returns the first value, PropertyMap::values()returns all values, independent on the number of values.

Given the choice between an inconsistent and illogical behavior on one hand, and consistent and logical on the other, I definitely prefer the second one.

bruns added a comment.Mar 30 2019, 5:32 PM

On the topic of "multiple values from one extractor" vs "multiple values from different extractors":

I think, when this problem arrives, it should be solved by using one PropertyMap per extractor and explicitly merging these.

  1. For per-extractor unique properties, this is just the union
  2. For repeated property values, just use the value once
  3. For contradicting values, this has to be dealed with on a per-property basis.

This avoids the extra storage in case of duplicated values, keeps the (recurrent!) cost and complexity of merging away from the consumers, and allows better handling or merging, as "dumb" merging of properties already loses information (e.g. where the information originated from).

bruns updated this revision to Diff 55103.Mar 31 2019, 2:28 AM

correct comments

My main concern is that this changes current behavior and as such is not backwards compatible. The rest is a bonus.
That is why my plan was to make the extractors output (string-)lists as outlined in T8196 (this task has been there for quite a while, feedback would have been nice).
This way, current behavior can be preserved while also fixing all bugs with multi-value properties.
If changing this behavior is okay, then you should at least post patches that adapt all clients. I know of Baloo-widgets, Dolphin and Elisa.
But there will be combinations of frameworks and applications that will not work correctly for multi-value properties.
What do others think of this? @mgallien?

astippich added inline comments.Apr 4 2019, 5:07 PM
src/file/propertydata.cpp
89

Can you return arrays as lists here so that current behavior is unchanged, and put the behavior change into another patch so that we can merge this diff?

Are you still interested in working on this? If not, I would like to take over with the following plan:
I would like to propose a transition phase, e.g. we make sure that there is a release available that can handle both formats.
This diff lands with my requested changes.
After the Applications 19.08 release, which can handle both cases, the output format is changed to multiple entries in the property map.
All required patches for the applications have already landed: Elisa (D20502) and Baloo-Widgets (D20739), and Dolphin (D21157)

astippich accepted this revision.Sep 7 2019, 12:38 PM

Since released versions of dolphin, baloo-widgets and elisa now actually cope with this case, I am no longer opposed merging this

This revision is now accepted and ready to land.Sep 7 2019, 12:38 PM
ngraham accepted this revision.Sep 8 2019, 10:24 AM
This revision was automatically updated to reflect the committed changes.