[Extractor] Add metadata to extractors
ClosedPublic

Authored by bruns on Feb 18 2019, 12:05 AM.

Details

Summary

This adds extractor metadata in a backwards and forward compatible way.

There are several use cases for this metadata:

  • Delayed loading of extractor plugins - currently, all extractors are loaded and and initialized when an ExtractorCollection is created.
  • Versioning information - e.g. Baloo would benefit from versioning information, to reindex affected files after an extractor has been updated.

Although it would be possible to extend the extractor plugin interface
with a method for each relevant property, it would require a bump of
the plugin inteface version each time the interface is extended.

CCBUG: 404171
See: T9867, T8079

Test Plan

ctest

Diff Detail

Repository
R286 KFileMetaData
Lint
Automatic diff as part of commit; lint not applicable.
Unit
Automatic diff as part of commit; unit tests not applicable.
bruns created this revision.Feb 18 2019, 12:05 AM
Restricted Application added projects: Frameworks, Baloo. · View Herald TranscriptFeb 18 2019, 12:05 AM
Restricted Application added a subscriber: kde-frameworks-devel. · View Herald Transcript
bruns requested review of this revision.Feb 18 2019, 12:05 AM
bruns edited the summary of this revision. (Show Details)Feb 18 2019, 12:10 AM
bruns retitled this revision from [Extractor] Add metadata to properties to [Extractor] Add metadata to extractors.Feb 18 2019, 1:17 AM

A few general remarks:

  • I really do not like that there are two lists of supported mimetypes now which have to be kept in sync
  • Do we really need versioning per mimetype? IMHO it is sufficient to have a version number per extractor. From my experience, fixing an extractor usually impacts all its supported mimetypes, and rarily affects only one mimetype. Also, this makes the list hard to maintain, also regarding file types which have multiple mime types, e.g. audio/wav and audio/x-wav
  • Do we need an x.y version? I think a single integer is enough or what do you have in mind?
  • I prefer to directly construct the qvariantmap in the extractors, and re-use the mimetype list which is already available.

A few general remarks:

  • I really do not like that there are two lists of supported mimetypes now which have to be kept in sync

I think this is trivial enough. Also this is covered by the unit test.

  • Do we really need versioning per mimetype? IMHO it is sufficient to have a version number per extractor. From my experience, fixing an extractor usually impacts all its supported mimetypes, and rarily affects only one mimetype.

Past experience tells otherwise. There have been feature extensions and bugfixes for specific mimetypes, just look at your own commits

  • "fix ape disc number extraction"
  • "implement more tags for asf metadata"
  • ...

I want to reduce reindexing as much as possible.

Also, this makes the list hard to maintain, also regarding file types which have multiple mime types, e.g. audio/wav and audio/x-wav
  • Do we need an x.y version? I think a single integer is enough or what do you have in mind?

Changes only affecting failed files are minor versions, changes affecting already indexed files (i.e. support for new properties) get a new major version.

  • I prefer to directly construct the qvariantmap in the extractors, and re-use the mimetype list which is already available.

Requires changing the plugin interface. Does not allow to query extractor properties without fully loading the plugin (which is expensive). Read https://vizzzion.org/blog/2013/08/ "K_PLUGIN_FACTORY_WITH_JSON or where is the metadata?"

A few general remarks:

  • I really do not like that there are two lists of supported mimetypes now which have to be kept in sync

I think this is trivial enough. Also this is covered by the unit test.

My fear is that it is easily forgotten, but I did not see the autotest. Still, do you think it is feasible to generate the mimetype stringlist from the JSON data to remove the duplication?

  • Do we really need versioning per mimetype? IMHO it is sufficient to have a version number per extractor. From my experience, fixing an extractor usually impacts all its supported mimetypes, and rarily affects only one mimetype.

Past experience tells otherwise. There have been feature extensions and bugfixes for specific mimetypes, just look at your own commits

  • "fix ape disc number extraction"
  • "implement more tags for asf metadata"
  • ...

    I want to reduce reindexing as much as possible.

And I can give you examples where this was not the case :). This is also only the case because TagLibExtractor was stupidly written (which D18826 fixes). The other extractors do not have that many special codepath.
Well, I find it cumbersome to implement this fine-grained control, but otherwise people will probably yell because of high cpu usage...
At least, I would like to group duplicated mimetypes such as audio/wav and audio/x-wav, but that is not possible with JSON, is it?

Also, this makes the list hard to maintain, also regarding file types which have multiple mime types, e.g. audio/wav and audio/x-wav
  • Do we need an x.y version? I think a single integer is enough or what do you have in mind?

Changes only affecting failed files are minor versions, changes affecting already indexed files (i.e. support for new properties) get a new major version.

  • I prefer to directly construct the qvariantmap in the extractors, and re-use the mimetype list which is already available.

Requires changing the plugin interface. Does not allow to query extractor properties without fully loading the plugin (which is expensive). Read https://vizzzion.org/blog/2013/08/ "K_PLUGIN_FACTORY_WITH_JSON or where is the metadata?"

Thanks.

bruns added a comment.EditedFeb 19 2019, 10:16 PM

A few general remarks:

  • I really do not like that there are two lists of supported mimetypes now which have to be kept in sync

I think this is trivial enough. Also this is covered by the unit test.

My fear is that it is easily forgotten, but I did not see the autotest. Still, do you think it is feasible to generate the mimetype stringlist from the JSON data to remove the duplication?

These are not completely duplicate - e.g. the officeextractor (pre-2007) uses runtime detection of some binary helpers. If these are not found, the list returned by the plugin is empty. The plugin has no direct access to its metadata, as it is only available from the loader and there is no possibility to pass it back, so it can not default to it.

  • Do we really need versioning per mimetype? IMHO it is sufficient to have a version number per extractor. From my experience, fixing an extractor usually impacts all its supported mimetypes, and rarily affects only one mimetype.

Past experience tells otherwise. There have been feature extensions and bugfixes for specific mimetypes, just look at your own commits

  • "fix ape disc number extraction"
  • "implement more tags for asf metadata"
  • ...

    I want to reduce reindexing as much as possible.

And I can give you examples where this was not the case :).

... which does not prohibit bumping the version for all affected encoders. Also, there is nothing disallowing to skip versions, e.g. if "foo/bar" is 2.1, and "foo/baz" is 1.3, and both get a major bump, both can be set to 3.0.

This is also only the case because TagLibExtractor was stupidly written (which D18826 fixes). The other extractors do not have that many special codepath.
Well, I find it cumbersome to implement this fine-grained control, but otherwise people will probably yell because of high cpu usage...
At least, I would like to group duplicated mimetypes such as audio/wav and audio/x-wav, but that is not possible with JSON, is it?

You can reorder any aliasing mimetypes.

Another question is, why do we have "audio/wav" and "audio/x-wav" in the first place? Are there really files where one type is a reported for one file, and the other for other files? Wouldn't it be better to just have the canonical type? At least on my computer, shared-mime-info only has audio/x-wav, listing audio/wav and audio/vnd.wave as aliases. Aliases should never be returned by QMimeDatabase.

astippich accepted this revision.Feb 23 2019, 3:10 PM

I would also like to remove the aliasing mimetypes. But I guess due to the implementation where the mimetype is given as QString, and there is no guarantee that it is obtained from QMimeType, it should handle them.

This revision is now accepted and ready to land.Feb 23 2019, 3:10 PM
bruns edited the summary of this revision. (Show Details)Feb 23 2019, 7:12 PM
bruns updated this revision to Diff 52402.Feb 23 2019, 8:35 PM

add AppImage extractor metadata

This revision was automatically updated to reflect the committed changes.