Store filename terms just once
ClosedPublic

Authored by bruns on Apr 17 2020, 11:22 PM.

Details

Summary

Filename terms were stored twice, once with the "F" filename property
prefix, and once without prefix. This allows to trivially search for
files where a term matches in filename or content, but has a number
of drawbacks:

  1. It is not possible to search for a term in content only
  2. The storage size for filenames is approximately doubled
  3. File renaming can cause significant I/O load
  4. Terms appearing in both content and filename may be stored incomplete in the phrase storage.

Re (2.), in case full text indexing is disabled this is a significant
part of the storage size. With full text indexing, the space savings
are likely neglegible.

Re (3.), when renaming a file where part of the filename is a common term,
e.g. "The fox.txt", renaming caused rewriting of data for "the", "fox"
and "txt". While for "txt" and "fox" this is neglegible, "the" is common
enough to cause a of rewrite of 10% of the whole DB.

The default search behaviour of matching both filename and content
has been restored by internally creating queries for both filename and
content and ORing both together. This extra step does not have any
noticeable (or even measurable) performance impact.

Depends on D28929

Test Plan

$> ctest -R querytest
$> baloosearch content:pdf
$> baloosearch filename:pdf
$> baloosearch pdf
$> baloosearch content:pdf OR filename:pdf
(the last two queries are equivalent)

Diff Detail

Repository
R293 Baloo
Lint
Automatic diff as part of commit; lint not applicable.
Unit
Automatic diff as part of commit; unit tests not applicable.
bruns created this revision.Apr 17 2020, 11:22 PM
Restricted Application added projects: Frameworks, Baloo. · View Herald TranscriptApr 17 2020, 11:22 PM
Restricted Application added a subscriber: kde-frameworks-devel. · View Herald Transcript
bruns requested review of this revision.Apr 17 2020, 11:22 PM
bruns updated this revision to Diff 80435.Apr 17 2020, 11:44 PM

whitespace

bruns updated this revision to Diff 80439.Apr 18 2020, 1:54 AM

add missing tests

I'll get around to reviewing this soon. I'm trying to figure out of I think the loss is acceptable.

bruns added a comment.Apr 25 2020, 3:05 PM

I'll get around to reviewing this soon. I'm trying to figure out of I think the loss is acceptable.

There is no loss, there is even a gain (queries work correctly in all constellations).

bruns edited the summary of this revision. (Show Details)May 2 2020, 12:55 PM
bruns added a comment.May 4 2020, 4:28 PM

This has been pending for more than two weeks now, without any sort of review ...

@ngraham If you have any questions, please ask!

ngraham accepted this revision.May 4 2020, 7:03 PM

Sorry for the delay. Makes sense.

This revision is now accepted and ready to land.May 4 2020, 7:03 PM
This revision was automatically updated to reflect the committed changes.