[TermGenerator] Do Term truncation prior to UTF-8 conversion
ClosedPublic
Actions

Authored by bruns on Jun 16 2019, 10:42 PM.

Details

Reviewers

ngraham
astippich
poboiko

Group Reviewers

Baloo

Commits

R293:a99bb98143c6: [TermGenerator] Do Term truncation prior to UTF-8 conversion

Summary

The (somewhat arbitrary) term truncation was applied to the UTF-8 encoded
data, somethimes truncating the term in the middle of a codepoint.

Truncate the QString instead. This also has the effect of leaving more
useful characters for languages where the majority of codepoints are
encoded as 2 or more bytes.

This requires some extra storage size in the DB when a term which would
have been truncated previously now goes in as is, but likely only a few
terms / languages are affected (for english words UTF-8 encodes most
codepoints in 1 byte).

There is a small caveat for the SearchStore. As queries were truncated
likewise, an untruncated query would no longer find untruncated terms from
new index runs. To allow matches nevertheless, truncated terms use
StartsWith instead of Equal matches.

Test Plan

ctest

Diff Detail

Repository

R293 Baloo

Lint

Automatic diff as part of commit; lint not applicable.

Unit

Automatic diff as part of commit; unit tests not applicable.

bruns created this revision.Jun 16 2019, 10:42 PM

Restricted Application added projects: Frameworks, Baloo. · View Herald TranscriptJun 16 2019, 10:42 PM

Restricted Application added a subscriber: kde-frameworks-devel. · View Herald Transcript

bruns requested review of this revision.Jun 16 2019, 10:42 PM

Harbormaster completed remote builds in B12921: Diff 59964.Jun 16 2019, 10:42 PM

bruns mentioned this in D21839: [TermGenerator] Use UTF-8 ByteArray for termList.Jun 16 2019, 10:42 PM

ngraham accepted this revision.Jun 16 2019, 11:50 PM

This revision is now accepted and ready to land.Jun 16 2019, 11:50 PM

Closed by commit R293:a99bb98143c6: [TermGenerator] Do Term truncation prior to UTF-8 conversion (authored by bruns). · Explain WhyJun 17 2019, 12:01 AM

This revision was automatically updated to reflect the committed changes.

Revision Contents
Changeset List

			Path	Packages
M			src/engine/termgenerator.cpp (6 lines)
M			src/lib/searchstore.cpp (8 lines)

Diff	ID	Base	Description	Created	Lint	Unit
Base			Base
Diff 1	59964	c723243		Jun 16 2019, 10:42 PM	★	★
Diff 2	59968	4651988	R293:a99bb98143c6a0d06ea75eeecd93e44ccc3c8d6d	Jun 17 2019, 12:01 AM	★	★

[TermGenerator] Do Term truncation prior to UTF-8 conversionClosedPublicActions

Details

Diff Detail

Revision ContentsChangeset List

Diff 59968

src/engine/termgenerator.cpp

src/lib/searchstore.cpp

[TermGenerator] Do Term truncation prior to UTF-8 conversion
ClosedPublic
Actions

Revision Contents
Changeset List