Diffusion Baloo 7605f4d7f7c4

Store filename terms just once

Authored by bruns on Apr 17 2020, 11:18 PM.

Description

Store filename terms just once

Summary:
Filename terms were stored twice, once with the "F" filename property
prefix, and once without prefix. This allows to trivially search for
files where a term matches in filename or content, but has a number
of drawbacks:

  1. It is not possible to search for a term in content only
  2. The storage size for filenames is approximately doubled
  3. File renaming can cause significant I/O load
  4. Terms appearing in both content and filename may be stored incomplete in the phrase storage.

Re (2.), in case full text indexing is disabled this is a significant
part of the storage size. With full text indexing, the space savings
are likely neglegible.

Re (3.), when renaming a file where part of the filename is a common term,
e.g. "The fox.txt", renaming caused rewriting of data for "the", "fox"
and "txt". While for "txt" and "fox" this is neglegible, "the" is common
enough to cause a of rewrite of 10% of the whole DB.

The default search behaviour of matching both filename and content
has been restored by internally creating queries for both filename and
content and ORing both together. This extra step does not have any
noticeable (or even measurable) performance impact.

Depends on D28929

Test Plan:
$> ctest -R querytest
$> baloosearch content:pdf
$> baloosearch filename:pdf
$> baloosearch pdf
$> baloosearch content:pdf OR filename:pdf
(the last two queries are equivalent)

Reviewers: Baloo, ngraham

Reviewed By: Baloo, ngraham

Subscribers: kde-frameworks-devel

Tags: Frameworks, Baloo

Differential Revision: https://phabricator.kde.org/D28932

Details

Committed
brunsMay 4 2020, 7:48 PM
Reviewer
Baloo
Differential Revision
D28932: Store filename terms just once
Parents
R293:3eb0a513efa9: GIT_SILENT Upgrade ECM and KF5 version requirements for 5.70.0 release.
Branches
Unknown
Tags
Unknown