epubextractor: Handle multiple subjects better
AcceptedPublic

Authored by bruns on Feb 20 2018, 6:54 PM.

Details

Summary

Instead of returing one long word of ';'-concatenated subjects,
which is unwrappable return a list of strings and let the consumer handle it.

This will facilitate e.g. dolphin's tooltips display of multiple subjects.

Depends on D12197

Test Plan

unit test
visual inspection in dolphin with adapted baloo-widgets

Diff Detail

Repository
R286 KFileMetaData
Branch
multi-subject
Lint
No Linters Available
Unit
No Unit Test Coverage
michaelh created this revision.Feb 20 2018, 6:54 PM
Restricted Application added a subscriber: Frameworks. · View Herald TranscriptFeb 20 2018, 6:54 PM
michaelh requested review of this revision.Feb 20 2018, 6:54 PM
mgallien added a comment.EditedFeb 20 2018, 7:19 PM

The problem with any changes to the extractors its that the baloo database of the users will not be refreshed.

@mgallien Then they won't benefit from this until a file is reindexed.
Users with baloo disabled will benefit, because in that case baloo_filemetadata_temp_extractor will extract the data on-the-fly.
Reindexing just the ebooks can be done with:

find ~/EBooks/ -type f -name "*.epub" | { mapfile -t; balooctl clear "${MAPFILE[@]}";  balooctl index "${MAPFILE[@]}"; }

@mgallien Then they won't benefit from this until a file is reindexed.
Users with baloo disabled will benefit, because in that case baloo_filemetadata_temp_extractor will extract the data on-the-fly.
Reindexing just the ebooks can be done with:

find ~/EBooks/ -type f -name "*.epub" | { mapfile -t; balooctl clear "${MAPFILE[@]}";  balooctl index "${MAPFILE[@]}"; }

Could you also add a test for the compatibility with current unmodified code ? I would like to be sure that we do not introduce any incompatibility.

I am still not convinced we should change the extractors without an automatic solution for the reindexing case. You will have to ensure everybody can use the property both as a string and a stringlist forever.

The only component I could find to be affected by this change is baloo-widgets. I have already adapted it to this change. And yes, it will handle both.
It will take a some time to publish it because some other stuff has to get reviewed first.
I you know of any component or application using the epub-extractor of KFileMetadata apart from baloo-widgets please let me know.

The only component I could find to be affected by this change is baloo-widgets. I have already adapted it to this change. And yes, it will handle both.
It will take a some time to publish it because some other stuff has to get reviewed first.
I you know of any component or application using the epub-extractor of KFileMetadata apart from baloo-widgets please let me know.

I do understand that you have searched inside KDE hosted code. Still, you are modifying the behavior. To me, it is necessary to have a test that ensures that possible existing clients are not affected by your change. Could you add it ?

I have created T8079 to work on baloo database update when extractors are modified and returned different data.

To me, it is necessary to have a test that ensures that possible existing clients are not affected by your change. Could you add it ?

The new behaviour will break clients. No test needed for this. E.g. baloo-widgets will show the label 'Subject' but no content.
To make it clear: I do not want break this for clients apart from baloo-widgets. The question is: Who are the clients? and How to find them except for searching https://lxr.kde.org/? Please help me with this.

The current behaviour is already breaking: I have some epub files with a lot of dc:subject tags. The semicolon-concatenated string is longer than my monitor is wide. As a result in dolphin the tooltip does not display at all. My rationale was to tackle the problem at the root rather than to work around it by splitting the string within baloo-widgets. With regard to T8079 it will be the same. If data is taken from baloo's database that old long string will be displayed

According to the standard an epub can have multiple dc:subject entries. So it is reasonable to expect a string list from KFileMetaData. Plus the possibilities are grand with an KFileMetaData writer for epubs one could use the tag widget to change them. One could unify xattr tags and epub subjects in tags:// protocol and mayby more.

mgallien requested changes to this revision.Feb 27 2018, 8:47 PM

To me, it is necessary to have a test that ensures that possible existing clients are not affected by your change. Could you add it ?

The new behaviour will break clients. No test needed for this. E.g. baloo-widgets will show the label 'Subject' but no content.
To make it clear: I do not want break this for clients apart from baloo-widgets. The question is: Who are the clients? and How to find them except for searching https://lxr.kde.org/? Please help me with this.

Yesterday, I tried looking for such use and did not found them. I did try to use web search engine with no success. I am afraid I cannot help you but would be happy to hear any advice.

The current behaviour is already breaking: I have some epub files with a lot of dc:subject tags. The semicolon-concatenated string is longer than my monitor is wide. As a result in dolphin the tooltip does not display at all. My rationale was to tackle the problem at the root rather than to work around it by splitting the string within baloo-widgets. With regard to T8079 it will be the same. If data is taken from baloo's database that old long string will be displayed

I am sure that your work is really trying to fix real problem that are seeing. You have my full support on your aim.

I should apologize given how bad my knowledge of KFileMetaData is. It is already silently managing single and multiple entries. In fact, yes, the epub extractor is doing it wrong.
I have also checked how baloo is handling multiple entries and it works in the same way than pure kfilemetadata.

In the taglib extractors multiple entries are added in several call to add method. This is nicely handled by KFileMetaData::SimpleExtractionResult::add . It would be nice if you modify your patch to follow the same approach. In Baloo, things are a little bit different and a list is automatically created when multiple add are called with the same property but the result is similar (in baloo Result::add ).

You can forget my initial objection, I was plain wrong.

We may have bugs in user of the API not expecting a list when they should have. In my opinion, this should not block your diff. Let's fix them when we discover them.

According to the standard an epub can have multiple dc:subject entries. So it is reasonable to expect a string list from KFileMetaData. Plus the possibilities are grand with an KFileMetaData writer for epubs one could use the tag widget to change them. One could unify xattr tags and epub subjects in tags:// protocol and mayby more.

This revision now requires changes to proceed.Feb 27 2018, 8:47 PM

@mgallien : taglibextractor.cpp is very hard to read. Hopefully it is no longer.

src/extractors/epubextractor.cpp
99

Did you mean something like this?:

const auto& values = fetchMetadata(ePubDoc, EPUB_SUBJECT);
for (auto& value : values) {
        result->add(Property::Subject, value);
 }

It fails the test (of course). It it unclear to me how to handle result.properties().value(Property::Subject) and there is no example in the tests. They all compare against QVariant::Type::String. Do you have an example from elisa?

Sorry for being so late.

@mgallien : taglibextractor.cpp is very hard to read. Hopefully it is no longer.

src/extractors/epubextractor.cpp
99

Elisa code is not properly handling this case. When writing it, I had not understood that. Thanks to your diff, I have now properly read the code and I should go modify Elisa code.

In general, you can use canConvert<the type>() from QVariant. This should allow to handle both cases.
You can probably always convert to a string list QVariant::toStringList. It should be enough for properties with one string or a list of strings.

This is bad! I have learned baloo itself doesn't handle stringlists. Which in my view would be the natural way to handle token-like items like tags, keywords and subject(s). Until this is changed in baloo I'm putting this diff on hold and fro the time being work around these long subject-strings within baloo-widgets.

@mgallien: Did I understand you correctly, the refactoring helped you to understand it better? That would be an incentive for me to do more refactoring.

This is bad! I have learned baloo itself doesn't handle stringlists. Which in my view would be the natural way to handle token-like items like tags, keywords and subject(s). Until this is changed in baloo I'm putting this diff on hold and fro the time being work around these long subject-strings within baloo-widgets.

I believe this is quite the opposite. I am already getting strings list for some audio metadata. I would like to get your patch in given you make it works like already existing code and apart from the baloo-widgets code everybody should be fine with your modifications.

@mgallien: Did I understand you correctly, the refactoring helped you to understand it better? That would be an incentive for me to do more refactoring.

This is this patch that made me read the code in baloo that store the properties returned by the extractors. This code is handling strings list quite fine and automatically when a property is added several times.

I believe this is quite the opposite. I am already getting strings list for some audio metadata.

That would be great. Please point me to the respective code in elisa. I want to know which properties and how baloo is involved.
Maybe I gave up too soon. With the old kfilemetadata baloosearch subject:Dummy1 has a result. When kfilemetadata delivers subject as string list is does not. I will reinvestigate.

I would like to get your patch in given you make it works like already existing code and apart from the baloo-widgets code everybody should be fine with your modifications.

Please rephrase. It looks as if some words were deleted. Anyway, baloo-widgets is no problem at all. It's easy to adapt and to debug. It is baloo that's giving me headaches.

This is bad! I have learned baloo itself doesn't handle stringlists. Which in my view would be the natural way to handle token-like items like tags, keywords and subject(s). Until this is changed in baloo I'm putting this diff on hold and fro the time being work around these long subject-strings within baloo-widgets.

I believe this is quite the opposite. I am already getting strings list for some audio metadata. I would like to get your patch in given you make it works like already existing code and apart from the baloo-widgets code everybody should be fine with your modifications.

@mgallien: Did I understand you correctly, the refactoring helped you to understand it better? That would be an incentive for me to do more refactoring.

This is this patch that made me read the code in baloo that store the properties returned by the extractors. This code is handling strings list quite fine and automatically when a property is added several times.

Could you please point me to the code location? I would like to investigate how the code is used for D10803.
Also, since this discussion also applies for the taglibextractor: Is a string list preferred for new properties in KFileMetadata when multiple entries are possible? I think right now most of the properties are using a single concatenated string.

@mgallien: I think we have a beautiful misunderstanding here :-)

This is bad! ...

I was referring to baloo inablility to handle string lists and not to this diff.

I believe this is quite the opposite. ...

I took your answer the wrong way completely. Let's rewind our conversation a little

@mgallien: I think we have a beautiful misunderstanding here :-)

This is bad! ...

I was referring to baloo inablility to handle string lists and not to this diff.

I believe this is quite the opposite. ...

I took your answer the wrong way completely. Let's rewind our conversation a little

No problem from side. In fact this is quite the opposite given you made me better understand all this code. I have now to fix Elisa code to handle lists where I was expecting a single string.

Could you please update your diff and we can land it ? This is a useful improvement.

Could you please update your diff and we can land it ? This is a useful improvement.

  1. We can't land it yet. baloo searching breaks with this patch. baloo has be adapted first.
  2. I don't know what to update. Please tell me once more what you want me to change.

For this change concerted actions are needed. Let's discuss general questions here: T8196

Could you please update your diff and we can land it ? This is a useful improvement.

  1. We can't land it yet. baloo searching breaks with this patch. baloo has be adapted first.

I trust you on this one but on some property the taglib extractor is already doing multiple adds of the same property. It means Baloo is already storing lists.
I can help you but have not much time as usual.

  1. I don't know what to update. Please tell me once more what you want me to change.

You can have a look at the taglib extractor. There are multiple examples of for loops adding multiple times the same property.

For this change concerted actions are needed. Let's discuss general questions here: T8196

Thanks, this is a good idea.

michaelh added a comment.EditedMar 13 2018, 9:32 AM

Ahhhh.... Either I'm completely off-track here or kfilemetadata is not doing this correctly.
I see a lot of statements like artist += ', ' + value -> no list!
result->add calls QMap->addMulti(). This forces the client to iterate over the map, which is not necessary.
In contrary the value type is announced as string list:

kfilemetadata/src/propertyinfo.cpp: 52
     case Property::AlbumArtist:
            d->name = QStringLiteral("albumArtist");
            d->displayName = i18nc("@label", "Album Artist");
            d->valueType = QVariant::StringList;
            break;

The client can call toVariantMap() on the extraction result to get a QVariantMap but not a QVariant::StringList.
I really, really, really do not understand why kfilemetadata is not delivering a string list. It is much more natural and consistent with the announced Property info.
Also:

kfilemetadata/src/properties.h: 282
typedef QMap<Property::Property, QVariant> PropertyMap;
//In taglibextractor never a list --^`

I'm confused.

michaelh updated this revision to Diff 30858.Mar 29 2018, 4:08 PM
michaelh edited the summary of this revision. (Show Details)
  • Add multiple properies to map

Finally I understood how those multiple add()s were meant and it makes more sense now.
We're still not good to go because of this:

$ balooctl index test.epub
Indexing ~/frameworks/kfilemetadata/autotests/samplefiles/test.epub
File(s) indexed

$ balooshow test.epub
177442497910278149 2053 41314051 ~/frameworks/kfilemetadata/autotests/samplefiles/test.epub
        Author: Happy Man
        Title: The Big Brown Bear
        Subject: Baloo;KFile;Meta;Data
        Publisher: Happy Publisher
        Creation Date: 2014-01-01T01:01:01Z

$ baloo_filemetadata_temp_extractor test.epub
QMap(("author", QVariant(QString, "Happy Man"))("creationDate", QVariant(QDateTime, QDateTime(2014-01-01 01:01:01.000 UTC Qt::TimeSpec(UTC))))("publisher", QVariant(QString, "Happy Publisher"))("subject", QVariant(QString, "Baloo;KFile;Meta;Data"))("title", QVariant(QString, "The Big Brown Bear")))
AAAABQAAAA ...

Those semicolons in Baloo;KFile;Meta;Data should not be there and I have no idea where they come from.

The semicolons problem is fixed now. See T8196

michaelh updated this revision to Diff 30972.Mar 30 2018, 8:13 PM
  • Additionally use keywords property

This way the value-type of subject can be preserved

bruns added a subscriber: bruns.Mar 31 2018, 12:56 AM
michaelh updated this revision to Diff 32124.EditedApr 14 2018, 4:15 PM
michaelh edited the summary of this revision. (Show Details)

Remove tests from epubextractortest use multivaluetest instead
Make it depend on D12197

bruns added a comment.Apr 15 2018, 2:59 AM

Sorry to join late here ...

A QMap can store multiple values for one key, and a client reading the Map can use QMap::values() to get a list of all matching properties. If a client naively uses value() instead, it just gets the first value for the key, but so be it.

mgallien accepted this revision.Apr 18 2018, 7:52 PM

Thanks

This revision is now accepted and ready to land.Apr 18 2018, 7:52 PM
bruns added inline comments.Apr 19 2018, 4:07 PM
src/extractors/epubextractor.cpp
85

I think we should add each title individually (there may be one per language).

Dito for all other properties, see below.

97

This looks a little bit inconsistent - if we have only one value, we add the value to two different Properties, otherwise only Keywords is used.

DublinCore uses subject and keywords synonymously. It recommends to use one property entry per keyword.

See Recommendation 5 of http://dublincore.org/documents/dc-xml-guidelines/
and e.g. http://dublincore.org/groups/collections/collection-application-profile/#coldcsubject

I think we should stick with either Subject or Keywords, but not both.

michaelh added inline comments.Apr 19 2018, 6:15 PM
src/extractors/epubextractor.cpp
85

I think we should port away from libepub. Multiple titles result in one ';'-joined string.
Also it seems to be unmaintained.

97

Right, this inconsistency is intentional, and it needs discussion. That's why I added a comment in D12197 which was probably overlooked.
DC and IDPF aren't very clear on how to use dc:subject. Calibre interprets it as tags, My impression is, that most provider also do. Hence I prefer to use Property::Keywords only because it comes closest imo. That change would not really be breaking as currently Property::Subject is one large string joined with ';'.

bruns added inline comments.Apr 19 2018, 7:24 PM
src/extractors/epubextractor.cpp
85

The joined titles is the fault of this epubextractor AFAICS - see fetchMetadataString

97

Distinction between Subject and Keywords typically is keywords are just a bunch of words without further specification, while subject, as specified by DC, and as used by e.g. libraries, are taken from filed specific catalogs.

Baloos properties documentation specifically mentions dc:subject for Properties:Subject.

One of the file formats which has both, keywords and subject, is ODF, which uses dc:subject and meta:keywords.

DC specifies for any property:

Recommendation 5. Multiple property values should be encoded by repeating the XML element for that property.

My opinion is to always use Properties::Subject for dc:subject (as documented for baloo), and add each property instance individually. If properties are already messed up in the originating document, there is nothing we can do, but we should not make things worse.

bruns requested changes to this revision.May 9 2018, 9:49 PM

Discussion regarding subject vs keywords is still pending, i.e. no conclusion.

This revision now requires changes to proceed.May 9 2018, 9:49 PM
Restricted Application edited subscribers, added: Baloo, kde-frameworks-devel; removed: Frameworks. · View Herald TranscriptMay 9 2018, 9:49 PM
bruns commandeered this revision.Dec 5 2018, 8:44 PM
bruns edited reviewers, added: michaelh; removed: bruns.

Picking this up ...

This revision is now accepted and ready to land.Dec 5 2018, 8:44 PM