[TermGenerator] Use UTF-8 ByteArray for termList
ClosedPublic
Actions

Authored by bruns on Jun 16 2019, 12:29 AM.

Details

Reviewers

ngraham
astippich
poboiko

Group Reviewers

Baloo

Commits

R293:7420bc056326: [TermGenerator] Use UTF-8 ByteArray for termList

Summary

All users of TermGenerator::termList do not use QStrings but UTF-8
ByteArrays. Remove repetitive code.

While at it, make the lists const to avoid detach on iteration.

Test Plan

ctest

Diff Detail

Repository

R293 Baloo

Lint

Automatic diff as part of commit; lint not applicable.

Unit

Automatic diff as part of commit; unit tests not applicable.

bruns created this revision.Jun 16 2019, 12:29 AM

Restricted Application added projects: Frameworks, Baloo. · View Herald TranscriptJun 16 2019, 12:29 AM

Restricted Application added a subscriber: kde-frameworks-devel. · View Herald Transcript

Harbormaster completed remote builds in B12875: Diff 59896.Jun 16 2019, 12:29 AM

bruns requested review of this revision.Jun 16 2019, 12:29 AM

ngraham accepted this revision.Jun 16 2019, 5:12 PM

This revision is now accepted and ready to land.Jun 16 2019, 5:12 PM

Closed by commit R293:7420bc056326: [TermGenerator] Use UTF-8 ByteArray for termList (authored by bruns). · Explain WhyJun 16 2019, 5:13 PM

This revision was automatically updated to reflect the committed changes.

Actually, there is an issue with that code right now, which I wanted to fix, but forgot.
The trimming part finalArr = finalArr.mid(0, maxTermSize); actually should be performed on QString instead of QByteArray - unicode symbols inside term can consist of two bytes, and cutting at maxTermSize bytes can actually cut half of last symbol. I end up with terms like тождественно� inside balooshow -x.
Not to mention that russian terms end up being pretty small.

In D21839#480765, @poboiko wrote:

Actually, there is an issue with that code right now, which I wanted to fix, but forgot.
The trimming part finalArr = finalArr.mid(0, maxTermSize); actually should be performed on QString instead of QByteArray - unicode symbols inside term can consist of two bytes, and cutting at maxTermSize bytes can actually cut half of last symbol. I end up with terms like тождественно� inside balooshow -x.
Not to mention that russian terms end up being pretty small.

As the limit is somewhat arbitrary, maybe we can just limit the QString? I don't think this has any serious side effects.

As the limit is somewhat arbitrary, maybe we can just limit the QString? I don't think this has any serious side effects.

Yep, that's what I've suggested (if I understood you correctly).
I guess, we can put the trimming part right to termList(), it will further reduce code repetition. Something like

str = str.left(maxTermSize); 
list << str.toUtf8();

In D21839#480777, @poboiko wrote:
As the limit is somewhat arbitrary, maybe we can just limit the QString? I don't think this has any serious side effects.

Yep, that's what I've suggested (if I understood you correctly).
I guess, we can put the trimming part right to termList(), it will further reduce code repetition. Something like
str = str.left(maxTermSize); 
list << str.toUtf8();

See D21865