Diffusion Baloo a99bb98143c6

[TermGenerator] Do Term truncation prior to UTF-8 conversion

Authored by bruns on Jun 16 2019, 10:40 PM.

Description

[TermGenerator] Do Term truncation prior to UTF-8 conversion

Summary:
The (somewhat arbitrary) term truncation was applied to the UTF-8 encoded
data, somethimes truncating the term in the middle of a codepoint.

Truncate the QString instead. This also has the effect of leaving more
useful characters for languages where the majority of codepoints are
encoded as 2 or more bytes.

This requires some extra storage size in the DB when a term which would
have been truncated previously now goes in as is, but likely only a few
terms / languages are affected (for english words UTF-8 encodes most
codepoints in 1 byte).

There is a small caveat for the SearchStore. As queries were truncated
likewise, an untruncated query would no longer find untruncated terms from
new index runs. To allow matches nevertheless, truncated terms use
StartsWith instead of Equal matches.

Test Plan: ctest

Reviewers: Baloo, ngraham, astippich, poboiko

Reviewed By: Baloo, ngraham

Subscribers: kde-frameworks-devel

Tags: Frameworks, Baloo

Differential Revision: https://phabricator.kde.org/D21865

Details

Committed
brunsJun 17 2019, 12:01 AM
Reviewer
Baloo
Differential Revision
D21865: [TermGenerator] Do Term truncation prior to UTF-8 conversion
Parents
R293:4651988e2963: [IdUtils] Fix aliasing warning
Branches
Unknown
Tags
Unknown