[baloo_file_extractor] Improve handling of large plain-text files
Needs ReviewPublic
Actions

Authored by poboiko on Sep 8 2019, 12:05 PM.

Details

Reviewers

bruns
ngraham

Group Reviewers

Baloo

Summary

Not all plain text-based mimetypes starts with text/:
i.e. application/sql for SQL dumps (already handled in FileExcludeFilters),
or application/postscript for PS images (also handled by PostscriptDSCExtractor).
However, there are most likely to be more.

This patch aims at handling the issue in more generic way:
it just skips PlaintextExtractor for any large file, utilizing extractor metadata introduced in D19109: [Extractor] Add metadata to extractors.

Alternative solution would be using QMimeType::inherits instead.

Test Plan

Create large .txt file (>10Mb)
baloo_file_extractor still skips it.

Diff Detail

Repository

R293 Baloo

Branch

improve-large-text-files (branched from master)

Lint

No Linters Available

Unit

No Unit Test Coverage

Build Status

Buildable 17328
Build 17346: arc lint + arc unit

poboiko created this revision.Sep 8 2019, 12:05 PM

Restricted Application added projects: Frameworks, Baloo. · View Herald TranscriptSep 8 2019, 12:05 PM

poboiko requested review of this revision.Sep 8 2019, 12:05 PM

Harbormaster completed remote builds in B16251: Diff 65646.Sep 8 2019, 12:05 PM

broulik added a subscriber: broulik.Sep 8 2019, 12:07 PM

broulik added inline comments.

src/file/extractor/app.cpp
184	Store the size in a variable outside the loop, otherwise you end up querying it on each iteration.

Address raised issue: fetch file size only once

Harbormaster completed remote builds in B16998: Diff 66805.Sep 25 2019, 10:41 AM

poboiko marked an inline comment as done.Sep 25 2019, 10:42 AM

ngraham accepted this revision.Sep 25 2019, 11:49 PM

This revision is now accepted and ready to land.Sep 25 2019, 11:49 PM

Can you please provide an example which:

is currently indexed though it should be skipped due to size
is skipped after this change

and another example which:

is currently skipped though it should be indexed
is indexed after this change

src/file/extractor/app.cpp
184	You should compare for size first, as that's much cheaper than fetching the property and comparing the string.
185	Users will love us for spammig the logs ...

This revision now requires changes to proceed.Sep 26 2019, 12:21 AM

In D23787#537891, @bruns wrote:

Can you please provide an example which:

is currently indexed though it should be skipped due to size

is skipped after this change

Sure. Any mimetype inherited from "text/plain", but starting with "text/" counts. I've made an actual list:

list.txt3 KBDownload

(using simple python script, which iterates over QMimeDatabase().allMimeTypes(), checks if type.inherits("text/plain") and is not already excluded by default Baloo config from file/fileexcludefilters.cpp)

By looking at list, I see that some of them might be pretty heavy (and useless to index). For example, application/x-valgrind-massif, or application/sql (I know, SQL dumps are excluded by extension *.sql, but someone might simply use another extension like .dump). It's also pretty easy to imagine large Wolfram Mathematica file, i.e. containing pictures (that corresponds to application/mathematica from the list; although on my computer those are detected as application/vnd.wolfram.nb, which for some reason do not inherit text/plain, although it's plaintext-based).

We can do our best to exclude undesired types, but I'm not sure we will be able to cover all of them. And some files might be of desirable type, but simply too large (RSS feeds application/rss+xml, LyX files for some books application/x-lyx, mailboxes message/rfc822 or application/mbox).

and another example which:

is currently skipped though it should be indexed

is indexed after this change

There shouldn't be any. I mean, "PlaintextExtractor" should be inside exList for anything that starts with text/...

poboiko added inline comments.Oct 4 2019, 12:29 PM

src/file/extractor/app.cpp
185	I though users might actually want to know if file was excluded (and the reasoning behind it). I can make its severity less, i.e. `qCDebug`. Or you think it should be completely removed?

ngraham added inline comments.Oct 4 2019, 12:32 PM

src/file/extractor/app.cpp
185	Users don't read log files. :) Then again if the default log level is error rather than warning, then this isn't actually a logspam problem either.

davidedmundson added a subscriber: davidedmundson.Oct 4 2019, 12:41 PM

davidedmundson added inline comments.

src/file/extractor/app.cpp
173	This original line seemed very very wrong. Just because we won't want to index phase 2 isn't a reason to remove the filename indexing - it'll just keep running phase 1 on itself again and again. So +1 on that

Minor optimization: change check order (filesize / extractor property)

Also, it should be better check by "Id" instead of "Name"

Harbormaster completed remote builds in B17328: Diff 67318.Oct 4 2019, 12:46 PM

poboiko marked an inline comment as done.Oct 4 2019, 12:46 PM

Ping?

In D23787#541963, @poboiko wrote:

and another example which:

is currently skipped though it should be indexed

is indexed after this change

There shouldn't be any. I mean, "PlaintextExtractor" should be inside exList for anything that starts with text/...

Then whats the "Secondly ..." paragraph in the summary about?

src/file/extractor/app.cpp
173	Off by one error - phase one is content indexing already, phase zero is filename/filestat.

In D23787#541963, @poboiko wrote:

In D23787#537891, @bruns wrote:

Can you please provide an example which:

is currently indexed though it should be skipped due to size

is skipped after this change

Sure. Any mimetype inherited from "text/plain", but starting with "text/" counts. I've made an actual list:

list.txt3 KBDownload

(using simple python script, which iterates over QMimeDatabase().allMimeTypes(), checks if type.inherits("text/plain") and is not already excluded by default Baloo config from file/fileexcludefilters.cpp)

Your script is wrong. E.g. SVG inherits from text/plain, but has its own extractor, thus is not fed to the PlaintextExtractor. Dito for anything inheriting from XML.

@bruns: I've missed D16593: [ExtractorCollection] Use only best matching extractor plugin, and had in mind previous situation where we've matched all extractors based on inheritance. In that case, "Secondly" part indeed does not seem to apply anymore.
(as for my previous answer: I misunderstood you, thought you were asking about the case where PlainTextExtractor did not match & matched afterwards)

Your script is wrong. E.g. SVG inherits from text/plain, but has its own extractor, thus is not fed to the PlaintextExtractor. Dito for anything inheriting from XML.

I'm not claiming the list to be comprehensive, it's just a first approximation.
I'm claiming just that there is plethora of plain-text-based types (and might be even more in the future), some of which in principle might cause an issue.

There were plenty of situations in the past when users first encountered Baloo choking on some files (see git log of fileexcludefilters.cpp - SQL dumps, genome data, etc.), which made Baloo unusable for them.
Luckily for us, they reported it, and we blacklisted it. But I think it's unlikely we will manage to cover all the problematic cases that way (not all users report issues, and we're not familiar with all possible mimetypes).
This patch should serve as a preventive measure, reducing the probabilty of Baloo choking on it in the first place.

poboiko edited the summary of this revision. (Show Details)Nov 13 2019, 1:11 PM

Revision Contents
Changeset List

			Path	Packages
M			src/file/extractor/app.cpp (18 lines)

Diff	ID	Base	Description	Created	Lint	Unit
Base			Base
Diff 1	65646	1864d58		Sep 8 2019, 12:05 PM	★	★
Diff 2	66805	a9a3bd7		Sep 25 2019, 10:40 AM	★	★
Diff 3	67318	a9a3bd7	Minor optimization: change check order (filesize / extractor property)	Oct 4 2019, 12:46 PM	★	★

Commit	Tree	Parents	Author	Summary	Date
d0cc5cc70eab	7908a102a98f	a9a3bd732dd8	Igor Poboiko	[baloo_file_extractor] Improve handling of large plain-text files (Show More…)	Sep 8 2019, 12:02 PM

Diff 67318

View Options

src/file/extractor/app.cpp