Differential D28932

Store filename terms just once
ClosedPublic
Actions

Authored by bruns on Apr 17 2020, 11:22 PM.

Details

Reviewers

ngraham

Group Reviewers

Baloo

Commits

R293:7605f4d7f7c4: Store filename terms just once

Summary

Filename terms were stored twice, once with the "F" filename property
prefix, and once without prefix. This allows to trivially search for
files where a term matches in filename or content, but has a number
of drawbacks:

It is not possible to search for a term in content only
The storage size for filenames is approximately doubled
File renaming can cause significant I/O load
Terms appearing in both content and filename may be stored incomplete in the phrase storage.

Re (2.), in case full text indexing is disabled this is a significant
part of the storage size. With full text indexing, the space savings
are likely neglegible.

Re (3.), when renaming a file where part of the filename is a common term,
e.g. "The fox.txt", renaming caused rewriting of data for "the", "fox"
and "txt". While for "txt" and "fox" this is neglegible, "the" is common
enough to cause a of rewrite of 10% of the whole DB.

The default search behaviour of matching both filename and content
has been restored by internally creating queries for both filename and
content and ORing both together. This extra step does not have any
noticeable (or even measurable) performance impact.

Depends on D28929

Test Plan

$> ctest -R querytest
$> baloosearch content:pdf
$> baloosearch filename:pdf
$> baloosearch pdf
$> baloosearch content:pdf OR filename:pdf
(the last two queries are equivalent)

Diff Detail

Repository

R293 Baloo

Lint

Automatic diff as part of commit; lint not applicable.

Unit

Automatic diff as part of commit; unit tests not applicable.

bruns created this revision.Apr 17 2020, 11:22 PM

Restricted Application added projects: Frameworks, Baloo. · View Herald TranscriptApr 17 2020, 11:22 PM

Restricted Application added a subscriber: kde-frameworks-devel. · View Herald Transcript

bruns requested review of this revision.Apr 17 2020, 11:22 PM

Harbormaster completed remote builds in B25402: Diff 80434.Apr 17 2020, 11:22 PM

whitespace

Harbormaster completed remote builds in B25403: Diff 80435.Apr 17 2020, 11:44 PM

add missing tests

Harbormaster completed remote builds in B25406: Diff 80439.Apr 18 2020, 1:54 AM

Ping!

I'll get around to reviewing this soon. I'm trying to figure out of I think the loss is acceptable.

In D28932#657011, @ngraham wrote:

I'll get around to reviewing this soon. I'm trying to figure out of I think the loss is acceptable.

There is no loss, there is even a gain (queries work correctly in all constellations).

bruns added a dependent revision: D29207: [Indexers] Ignore name-based mimetype for initial indexing decisions.Apr 26 2020, 3:08 PM

Ping!

bruns edited the summary of this revision. (Show Details)May 2 2020, 12:55 PM

This has been pending for more than two weeks now, without any sort of review ...

@ngraham If you have any questions, please ask!

Sorry for the delay. Makes sense.

This revision is now accepted and ready to land.May 4 2020, 7:03 PM

Closed by commit R293:7605f4d7f7c4: Store filename terms just once (authored by bruns). · Explain WhyMay 4 2020, 7:48 PM

This revision was automatically updated to reflect the committed changes.

Revision Contents
Changeset List

			Path	Packages
M			autotests/integration/querytest.cpp (26 lines)
M			src/engine/termgenerator.h (1 line)
M			src/engine/termgenerator.cpp (8 lines)
M			src/file/basicindexingjob.cpp (1 line)
M			src/lib/searchstore.cpp (4 lines)

Diff	ID	Base	Description	Created	Lint	Unit
Base			Base
Diff 1	80434	7d8fd2e		Apr 17 2020, 11:22 PM	★	★
Diff 2	80435	7d8fd2e	whitespace	Apr 17 2020, 11:44 PM	★	★
Diff 3	80439	5b78240	add missing tests	Apr 18 2020, 1:54 AM	★	★
Diff 4	81937	3eb0a51	R293:7605f4d7f7c478f251f004d1abdaaeac27d530f7	May 4 2020, 7:48 PM	★	★

Commit	Tree	Parents	Author	Summary	Date
a9846a2d3097	29c4f8dc1876	5b78240441ac	Stefan Brüns	Store filename terms just once (Show More…)	Apr 17 2020, 11:18 PM

Status	Author	Revision
Closed	bruns	D29207 [Indexers] Ignore name-based mimetype for initial indexing decisions
Closed	bruns	D28932 Store filename terms just once
Closed	bruns	D28929 [QueryTest] Track if phrase matches in content or filename
Closed	bruns	D28925 [QueryTest] Extend phrase query tests