[baloo_file_extractor] Improve handling of large plain-text files
Needs ReviewPublic

Authored by poboiko on Sep 8 2019, 12:05 PM.

Details

Reviewers
bruns
ngraham
Group Reviewers
Baloo
Summary

Not all plain text-based mimetypes starts with text/:
i.e. application/sql for SQL dumps (already handled in FileExcludeFilters),
or application/postscript for PS images (also handled by PostscriptDSCExtractor).
However, there are most likely to be more.

This patch aims at handling the issue in more generic way:
it just skips PlaintextExtractor for any large file, utilizing extractor metadata introduced in D19109: [Extractor] Add metadata to extractors.

Alternative solution would be using QMimeType::inherits instead.

Test Plan
  1. Create large .txt file (>10Mb)
  2. baloo_file_extractor still skips it.

Diff Detail

Repository
R293 Baloo
Branch
improve-large-text-files (branched from master)
Lint
No Linters Available
Unit
No Unit Test Coverage
Build Status
Buildable 17328
Build 17346: arc lint + arc unit
poboiko created this revision.Sep 8 2019, 12:05 PM
Restricted Application added projects: Frameworks, Baloo. · View Herald TranscriptSep 8 2019, 12:05 PM
poboiko requested review of this revision.Sep 8 2019, 12:05 PM
broulik added inline comments.
src/file/extractor/app.cpp
184

Store the size in a variable outside the loop, otherwise you end up querying it on each iteration.

poboiko updated this revision to Diff 66805.Sep 25 2019, 10:41 AM

Address raised issue: fetch file size only once

poboiko marked an inline comment as done.Sep 25 2019, 10:42 AM
ngraham accepted this revision.Sep 25 2019, 11:49 PM
This revision is now accepted and ready to land.Sep 25 2019, 11:49 PM
bruns requested changes to this revision.Sep 26 2019, 12:21 AM

Can you please provide an example which:

  • is currently indexed though it should be skipped due to size
  • is skipped after this change

and another example which:

  • is currently skipped though it should be indexed
  • is indexed after this change
src/file/extractor/app.cpp
184

You should compare for size first, as that's much cheaper than fetching the property and comparing the string.

185

Users will love us for spammig the logs ...

This revision now requires changes to proceed.Sep 26 2019, 12:21 AM

Can you please provide an example which:

  • is currently indexed though it should be skipped due to size
  • is skipped after this change

Sure. Any mimetype inherited from "text/plain", but starting with "text/" counts. I've made an actual list:


(using simple python script, which iterates over QMimeDatabase().allMimeTypes(), checks if type.inherits("text/plain") and is not already excluded by default Baloo config from file/fileexcludefilters.cpp)

By looking at list, I see that some of them might be pretty heavy (and useless to index). For example, application/x-valgrind-massif, or application/sql (I know, SQL dumps are excluded by extension *.sql, but someone might simply use another extension like .dump). It's also pretty easy to imagine large Wolfram Mathematica file, i.e. containing pictures (that corresponds to application/mathematica from the list; although on my computer those are detected as application/vnd.wolfram.nb, which for some reason do not inherit text/plain, although it's plaintext-based).

We can do our best to exclude undesired types, but I'm not sure we will be able to cover all of them. And some files might be of desirable type, but simply too large (RSS feeds application/rss+xml, LyX files for some books application/x-lyx, mailboxes message/rfc822 or application/mbox).

and another example which:

  • is currently skipped though it should be indexed
  • is indexed after this change

There shouldn't be any. I mean, "PlaintextExtractor" should be inside exList for anything that starts with text/...

poboiko added inline comments.Oct 4 2019, 12:29 PM
src/file/extractor/app.cpp
185

I though users might actually want to know if file was excluded (and the reasoning behind it).
I can make its severity less, i.e. qCDebug. Or you think it should be completely removed?

ngraham added inline comments.Oct 4 2019, 12:32 PM
src/file/extractor/app.cpp
185

Users don't read log files. :)

Then again if the default log level is error rather than warning, then this isn't actually a logspam problem either.

davidedmundson added inline comments.
src/file/extractor/app.cpp
173

This original line seemed very very wrong.

Just because we won't want to index phase 2 isn't a reason to remove the filename indexing - it'll just keep running phase 1 on itself again and again.

So +1 on that

poboiko updated this revision to Diff 67318.Oct 4 2019, 12:46 PM

Minor optimization: change check order (filesize / extractor property)

Also, it should be better check by "Id" instead of "Name"

poboiko marked an inline comment as done.Oct 4 2019, 12:46 PM

and another example which:

  • is currently skipped though it should be indexed
  • is indexed after this change

There shouldn't be any. I mean, "PlaintextExtractor" should be inside exList for anything that starts with text/...

Then whats the "Secondly ..." paragraph in the summary about?

src/file/extractor/app.cpp
173

Off by one error - phase one is content indexing already, phase zero is filename/filestat.

Can you please provide an example which:

  • is currently indexed though it should be skipped due to size
  • is skipped after this change

Sure. Any mimetype inherited from "text/plain", but starting with "text/" counts. I've made an actual list:


(using simple python script, which iterates over QMimeDatabase().allMimeTypes(), checks if type.inherits("text/plain") and is not already excluded by default Baloo config from file/fileexcludefilters.cpp)

Your script is wrong. E.g. SVG inherits from text/plain, but has its own extractor, thus is not fed to the PlaintextExtractor. Dito for anything inheriting from XML.

@bruns: I've missed D16593: [ExtractorCollection] Use only best matching extractor plugin, and had in mind previous situation where we've matched all extractors based on inheritance. In that case, "Secondly" part indeed does not seem to apply anymore.
(as for my previous answer: I misunderstood you, thought you were asking about the case where PlainTextExtractor did not match & matched afterwards)

Your script is wrong. E.g. SVG inherits from text/plain, but has its own extractor, thus is not fed to the PlaintextExtractor. Dito for anything inheriting from XML.

I'm not claiming the list to be comprehensive, it's just a first approximation.
I'm claiming just that there is plethora of plain-text-based types (and might be even more in the future), some of which in principle might cause an issue.

There were plenty of situations in the past when users first encountered Baloo choking on some files (see git log of fileexcludefilters.cpp - SQL dumps, genome data, etc.), which made Baloo unusable for them.
Luckily for us, they reported it, and we blacklisted it. But I think it's unlikely we will manage to cover all the problematic cases that way (not all users report issues, and we're not familiar with all possible mimetypes).
This patch should serve as a preventive measure, reducing the probabilty of Baloo choking on it in the first place.

poboiko edited the summary of this revision. (Show Details)Nov 13 2019, 1:11 PM