[TermGenerator] Use UTF-8 ByteArray for termList
ClosedPublic

Authored by bruns on Jun 16 2019, 12:29 AM.

Details

Summary

All users of TermGenerator::termList do not use QStrings but UTF-8
ByteArrays. Remove repetitive code.

While at it, make the lists const to avoid detach on iteration.

Test Plan

ctest

Diff Detail

Repository
R293 Baloo
Lint
Automatic diff as part of commit; lint not applicable.
Unit
Automatic diff as part of commit; unit tests not applicable.
bruns created this revision.Jun 16 2019, 12:29 AM
Restricted Application added projects: Frameworks, Baloo. · View Herald TranscriptJun 16 2019, 12:29 AM
Restricted Application added a subscriber: kde-frameworks-devel. · View Herald Transcript
bruns requested review of this revision.Jun 16 2019, 12:29 AM
ngraham accepted this revision.Jun 16 2019, 5:12 PM
This revision is now accepted and ready to land.Jun 16 2019, 5:12 PM
This revision was automatically updated to reflect the committed changes.
poboiko added a comment.EditedJun 16 2019, 5:13 PM

Actually, there is an issue with that code right now, which I wanted to fix, but forgot.
The trimming part finalArr = finalArr.mid(0, maxTermSize); actually should be performed on QString instead of QByteArray - unicode symbols inside term can consist of two bytes, and cutting at maxTermSize bytes can actually cut half of last symbol. I end up with terms like тождественно� inside balooshow -x.
Not to mention that russian terms end up being pretty small.

bruns added a comment.Jun 16 2019, 5:19 PM

Actually, there is an issue with that code right now, which I wanted to fix, but forgot.
The trimming part finalArr = finalArr.mid(0, maxTermSize); actually should be performed on QString instead of QByteArray - unicode symbols inside term can consist of two bytes, and cutting at maxTermSize bytes can actually cut half of last symbol. I end up with terms like тождественно� inside balooshow -x.
Not to mention that russian terms end up being pretty small.

As the limit is somewhat arbitrary, maybe we can just limit the QString? I don't think this has any serious side effects.

As the limit is somewhat arbitrary, maybe we can just limit the QString? I don't think this has any serious side effects.

Yep, that's what I've suggested (if I understood you correctly).
I guess, we can put the trimming part right to termList(), it will further reduce code repetition. Something like

str = str.left(maxTermSize); 
list << str.toUtf8();

As the limit is somewhat arbitrary, maybe we can just limit the QString? I don't think this has any serious side effects.

Yep, that's what I've suggested (if I understood you correctly).
I guess, we can put the trimming part right to termList(), it will further reduce code repetition. Something like

str = str.left(maxTermSize); 
list << str.toUtf8();

See D21865