Differential D19109

[Extractor] Add metadata to extractors
ClosedPublic
Actions

Authored by bruns on Feb 18 2019, 12:05 AM.

Details

Reviewers

ngraham
astippich
poboiko

Group Reviewers

Baloo
Frameworks

Commits

R286:de81ddb651b1: [Extractor] Add metadata to extractors

Summary

This adds extractor metadata in a backwards and forward compatible way.

There are several use cases for this metadata:

Delayed loading of extractor plugins - currently, all extractors are loaded and and initialized when an ExtractorCollection is created.
Versioning information - e.g. Baloo would benefit from versioning information, to reindex affected files after an extractor has been updated.

Although it would be possible to extend the extractor plugin interface
with a method for each relevant property, it would require a bump of
the plugin inteface version each time the interface is extended.

CCBUG: 404171
See: T9867, T8079

Test Plan

ctest

Diff Detail

Repository

R286 KFileMetaData

Lint

Automatic diff as part of commit; lint not applicable.

Unit

Automatic diff as part of commit; unit tests not applicable.

bruns created this revision.Feb 18 2019, 12:05 AM

Restricted Application added projects: Frameworks, Baloo. · View Herald TranscriptFeb 18 2019, 12:05 AM

Restricted Application added a subscriber: kde-frameworks-devel. · View Herald Transcript

bruns requested review of this revision.Feb 18 2019, 12:05 AM

Harbormaster completed remote builds in B8459: Diff 51936.Feb 18 2019, 12:05 AM

bruns edited the summary of this revision. (Show Details)Feb 18 2019, 12:10 AM

bruns retitled this revision from [Extractor] Add metadata to properties to [Extractor] Add metadata to extractors.Feb 18 2019, 1:17 AM

A few general remarks:

I really do not like that there are two lists of supported mimetypes now which have to be kept in sync
Do we really need versioning per mimetype? IMHO it is sufficient to have a version number per extractor. From my experience, fixing an extractor usually impacts all its supported mimetypes, and rarily affects only one mimetype. Also, this makes the list hard to maintain, also regarding file types which have multiple mime types, e.g. audio/wav and audio/x-wav
Do we need an x.y version? I think a single integer is enough or what do you have in mind?
I prefer to directly construct the qvariantmap in the extractors, and re-use the mimetype list which is already available.

In D19109#414758, @astippich wrote:

A few general remarks:

I really do not like that there are two lists of supported mimetypes now which have to be kept in sync

I think this is trivial enough. Also this is covered by the unit test.

Do we really need versioning per mimetype? IMHO it is sufficient to have a version number per extractor. From my experience, fixing an extractor usually impacts all its supported mimetypes, and rarily affects only one mimetype.

Past experience tells otherwise. There have been feature extensions and bugfixes for specific mimetypes, just look at your own commits

"fix ape disc number extraction"
"implement more tags for asf metadata"
...

I want to reduce reindexing as much as possible.

Also, this makes the list hard to maintain, also regarding file types which have multiple mime types, e.g. audio/wav and audio/x-wav
Do we need an x.y version? I think a single integer is enough or what do you have in mind?

Changes only affecting failed files are minor versions, changes affecting already indexed files (i.e. support for new properties) get a new major version.

I prefer to directly construct the qvariantmap in the extractors, and re-use the mimetype list which is already available.

Requires changing the plugin interface. Does not allow to query extractor properties without fully loading the plugin (which is expensive). Read https://vizzzion.org/blog/2013/08/ "K_PLUGIN_FACTORY_WITH_JSON or where is the metadata?"

In D19109#414968, @bruns wrote:

In D19109#414758, @astippich wrote:

A few general remarks:

I really do not like that there are two lists of supported mimetypes now which have to be kept in sync

I think this is trivial enough. Also this is covered by the unit test.

My fear is that it is easily forgotten, but I did not see the autotest. Still, do you think it is feasible to generate the mimetype stringlist from the JSON data to remove the duplication?

Do we really need versioning per mimetype? IMHO it is sufficient to have a version number per extractor. From my experience, fixing an extractor usually impacts all its supported mimetypes, and rarily affects only one mimetype.

Past experience tells otherwise. There have been feature extensions and bugfixes for specific mimetypes, just look at your own commits

"fix ape disc number extraction"

"implement more tags for asf metadata"

...

I want to reduce reindexing as much as possible.

And I can give you examples where this was not the case :). This is also only the case because TagLibExtractor was stupidly written (which D18826 fixes). The other extractors do not have that many special codepath.
Well, I find it cumbersome to implement this fine-grained control, but otherwise people will probably yell because of high cpu usage...
At least, I would like to group duplicated mimetypes such as audio/wav and audio/x-wav, but that is not possible with JSON, is it?

Also, this makes the list hard to maintain, also regarding file types which have multiple mime types, e.g. audio/wav and audio/x-wav
Do we need an x.y version? I think a single integer is enough or what do you have in mind?
Changes only affecting failed files are minor versions, changes affecting already indexed files (i.e. support for new properties) get a new major version.

I prefer to directly construct the qvariantmap in the extractors, and re-use the mimetype list which is already available.

Requires changing the plugin interface. Does not allow to query extractor properties without fully loading the plugin (which is expensive). Read https://vizzzion.org/blog/2013/08/ "K_PLUGIN_FACTORY_WITH_JSON or where is the metadata?"

Thanks.

In D19109#415710, @astippich wrote:

In D19109#414968, @bruns wrote:

In D19109#414758, @astippich wrote:

A few general remarks:

I really do not like that there are two lists of supported mimetypes now which have to be kept in sync

I think this is trivial enough. Also this is covered by the unit test.

My fear is that it is easily forgotten, but I did not see the autotest. Still, do you think it is feasible to generate the mimetype stringlist from the JSON data to remove the duplication?

These are not completely duplicate - e.g. the officeextractor (pre-2007) uses runtime detection of some binary helpers. If these are not found, the list returned by the plugin is empty. The plugin has no direct access to its metadata, as it is only available from the loader and there is no possibility to pass it back, so it can not default to it.

Do we really need versioning per mimetype? IMHO it is sufficient to have a version number per extractor. From my experience, fixing an extractor usually impacts all its supported mimetypes, and rarily affects only one mimetype.

Past experience tells otherwise. There have been feature extensions and bugfixes for specific mimetypes, just look at your own commits

"fix ape disc number extraction"

"implement more tags for asf metadata"

...

I want to reduce reindexing as much as possible.

And I can give you examples where this was not the case :).

... which does not prohibit bumping the version for all affected encoders. Also, there is nothing disallowing to skip versions, e.g. if "foo/bar" is 2.1, and "foo/baz" is 1.3, and both get a major bump, both can be set to 3.0.

This is also only the case because TagLibExtractor was stupidly written (which D18826 fixes). The other extractors do not have that many special codepath.
Well, I find it cumbersome to implement this fine-grained control, but otherwise people will probably yell because of high cpu usage...
At least, I would like to group duplicated mimetypes such as audio/wav and audio/x-wav, but that is not possible with JSON, is it?

You can reorder any aliasing mimetypes.

Another question is, why do we have "audio/wav" and "audio/x-wav" in the first place? Are there really files where one type is a reported for one file, and the other for other files? Wouldn't it be better to just have the canonical type? At least on my computer, shared-mime-info only has audio/x-wav, listing audio/wav and audio/vnd.wave as aliases. Aliases should never be returned by QMimeDatabase.

I would also like to remove the aliasing mimetypes. But I guess due to the implementation where the mimetype is given as QString, and there is no guarantee that it is obtained from QMimeType, it should handle them.

This revision is now accepted and ready to land.Feb 23 2019, 3:10 PM

also T8079

bruns edited the summary of this revision. (Show Details)Feb 23 2019, 7:12 PM

add AppImage extractor metadata

Harbormaster completed remote builds in B8749: Diff 52402.Feb 23 2019, 8:35 PM

Closed by commit R286:de81ddb651b1: [Extractor] Add metadata to extractors (authored by bruns). · Explain WhyFeb 23 2019, 8:35 PM

This revision was automatically updated to reflect the committed changes.

mgallien mentioned this in D20502: Check for string lists and multi-values in property map.Apr 22 2019, 8:56 PM

poboiko mentioned this in D23787: [baloo_file_extractor] Improve handling of large plain-text files.Sep 8 2019, 12:05 PM

Revision Contents
Changeset List

		Path
M		autotests/CMakeLists.txt (1 line)
M		autotests/extractorcollectiontest.cpp (59 lines)
M		src/extractor.h (4 lines)
M		src/extractor.cpp (10 lines)
M		src/extractor_p.h (1 line)
M		src/extractorcollection.h (4 lines)
M		src/extractorcollection.cpp (11 lines)
M		src/extractors/CMakeLists.txt (2 lines)
M		src/extractors/appimageextractor.h (3 lines)
A	M	src/extractors/appimageextractor.json (9 lines)
M		src/extractors/epubextractor.h (3 lines)
A	M	src/extractors/epubextractor.json (8 lines)
M		src/extractors/exiv2extractor.h (3 lines)
A	M	src/extractors/exiv2extractor.json.in (29 lines)
M		src/extractors/ffmpegextractor.h (3 lines)
A	M	src/extractors/ffmpegextractor.json (16 lines)
M		src/extractors/mobiextractor.h (3 lines)
A	M	src/extractors/mobiextractor.json (8 lines)
M		src/extractors/odfextractor.h (3 lines)
A	M	src/extractors/odfextractor.json (10 lines)
M		src/extractors/office2007extractor.h (3 lines)
A	M	src/extractors/office2007extractor.json (10 lines)
M		src/extractors/officeextractor.h (3 lines)
A	M	src/extractors/officeextractor.json (19 lines)
M		src/extractors/plaintextextractor.h (3 lines)
A	M	src/extractors/plaintextextractor.json (8 lines)
M		src/extractors/poextractor.h (3 lines)
A	M	src/extractors/poextractor.json (8 lines)
M		src/extractors/popplerextractor.h (3 lines)
A	M	src/extractors/popplerextractor.json (8 lines)
M		src/extractors/postscriptdscextractor.h (3 lines)
A	M	src/extractors/postscriptdscextractor.json (9 lines)
M		src/extractors/taglibextractor.h (3 lines)
A	M	src/extractors/taglibextractor.json (25 lines)
M		src/extractors/xmlextractor.h (3 lines)
A	M	src/extractors/xmlextractor.json (10 lines)

Diff	ID	Base	Description	Created	Lint	Unit
Base			Base
Diff 1	51936	4932727		Feb 18 2019, 12:05 AM	★	★
Diff 2	52402	1aa7f91	add AppImage extractor metadata	Feb 23 2019, 8:35 PM	★	★
Diff 3	52403	24359a0	R286:de81ddb651b14ca567e30c5bca4f7618894819a5	Feb 23 2019, 8:35 PM	★	★

Commit	Tree	Parents	Author	Summary	Date
50dfd2d5c00f	3a59a949ca12	1aa7f9168d51	Stefan Brüns	[Extractor] Add metadata to extractors (Show More…)	Nov 4 2018, 1:28 AM

[Extractor] Add metadata to extractorsClosedPublicActions

Details

Diff Detail

Revision ContentsChangeset List

Diff 52403

autotests/CMakeLists.txt

autotests/extractorcollectiontest.cpp

src/extractor.h

src/extractor.cpp

src/extractor_p.h

src/extractorcollection.h

src/extractorcollection.cpp

src/extractors/CMakeLists.txt

src/extractors/appimageextractor.h

src/extractors/appimageextractor.json

src/extractors/epubextractor.h

src/extractors/epubextractor.json

src/extractors/exiv2extractor.h

src/extractors/exiv2extractor.json.in

src/extractors/ffmpegextractor.h

src/extractors/ffmpegextractor.json

src/extractors/mobiextractor.h

src/extractors/mobiextractor.json

src/extractors/odfextractor.h

src/extractors/odfextractor.json

src/extractors/office2007extractor.h

src/extractors/office2007extractor.json

src/extractors/officeextractor.h

src/extractors/officeextractor.json

src/extractors/plaintextextractor.h

src/extractors/plaintextextractor.json

src/extractors/poextractor.h

src/extractors/poextractor.json

src/extractors/popplerextractor.h

src/extractors/popplerextractor.json

src/extractors/postscriptdscextractor.h

src/extractors/postscriptdscextractor.json

src/extractors/taglibextractor.h

src/extractors/taglibextractor.json

src/extractors/xmlextractor.h

src/extractors/xmlextractor.json

[Extractor] Add metadata to extractors
ClosedPublic
Actions

Revision Contents
Changeset List