[baloo_file_extractor] Improve handling of large plain-text files
Needs ReviewPublic

Authored by poboiko on Sun, Sep 8, 12:05 PM.

Details

Reviewers
bruns
ngraham
Group Reviewers
Baloo
Summary

First of all, not all plain text-based mimetypes starts with text/:
i.e. application/sql for SQL dumps (already handled in FileExcludeFilters),
or application/postscript for PS images. There are most likely to be more.
Alternative solution would be using QMimeType::inherits instead.

Secondly, not all extractors are bad with large files: for example, if it is
a PS image, then PostScriptDSExtractor still might extract useful information.
Issues are mostly caused by PlainTextExtractor, which generates just too much
terms.

This patch aims at tackling both issues: it just skips PlaintextExtractor for
large files, utilizing extractor metadata introduced in D19109: [Extractor] Add metadata to extractors.

Test Plan
  1. Create large .txt file (>10Mb)
  2. baloo_file_extractor still skips it.

Diff Detail

Repository
R293 Baloo
Branch
improve-large-text-files (branched from master)
Lint
No Linters Available
Unit
No Unit Test Coverage
Build Status
Buildable 16251
Build 16269: arc lint + arc unit
poboiko created this revision.Sun, Sep 8, 12:05 PM
Restricted Application added projects: Frameworks, Baloo. · View Herald TranscriptSun, Sep 8, 12:05 PM
poboiko requested review of this revision.Sun, Sep 8, 12:05 PM
broulik added inline comments.
src/file/extractor/app.cpp
183

Store the size in a variable outside the loop, otherwise you end up querying it on each iteration.