Write valid UTF8 characters without escaping.
ClosedPublic

Authored by vandenoever on Feb 17 2019, 10:40 PM.

Details

Summary

commit 6a18528 introduced escaping of bytes >= 127 to ensure that
KConfig files are valid UTF8.
The simplistic approach with a cutoff results in many escaped bytes
where it is not required. Especially non-western configuration files
would have many escapes.

This commit fixes that by only escaping bytes that are not valid UTF8.

BUG: 403557
FIXED-IN: 5.56

Test Plan

ninja && ninja test

Diff Detail

Repository
R237 KConfig
Lint
Automatic diff as part of commit; lint not applicable.
Unit
Automatic diff as part of commit; unit tests not applicable.
vandenoever created this revision.Feb 17 2019, 10:40 PM
Restricted Application added a project: Frameworks. · View Herald TranscriptFeb 17 2019, 10:40 PM
Restricted Application added a subscriber: kde-frameworks-devel. · View Herald Transcript
vandenoever requested review of this revision.Feb 17 2019, 10:40 PM

The commit fixes (as expected) the config write issue observed in 403557.

ngraham edited the summary of this revision. (Show Details)Feb 18 2019, 5:15 PM
ngraham added a reviewer: Frameworks.

No objection from me, but I'm no utf-8 expert.

Thiago's input would be very valuable...

autotests/kconfigtest.cpp
1774

What's the purpose of this very short-lived define, compared to just inlining this into the next line?

  • Remove VALUE define.
  • Spelling fix.
vandenoever added inline comments.Feb 18 2019, 9:50 PM
autotests/kconfigtest.cpp
1774

Right, copy-paste leftover. I'll fix it.

thiago added inline comments.Feb 19 2019, 3:35 PM
src/core/kconfigini.cpp
683

This function does operate properly to find valid syntax UTF-8 sequences, but it is neither catching overlong sequences nor UTF-8 content above U+10FFFF (UTF-8 can encode 0x11000 in 4 bytes).

See https://code.woboq.org/qt5/qtbase/tests/auto/corelib/codecs/utf8/utf8data.cpp.html#_Z19loadInvalidUtf8Rowsv for potential UTF-8 pitfalls.

vandenoever added inline comments.Feb 19 2019, 4:23 PM
src/core/kconfigini.cpp
683

A check for U+10FFFF > value is needed.
Overlong sequences are caught on line 696 (count < 4).

thiago added inline comments.Feb 19 2019, 4:48 PM
src/core/kconfigini.cpp
683

That's not what an overlong sequence is. You can produce 2-, 3- and 4-byte overlong sequences. See the examples in the Qt test, like 0xc0 0x80.

vandenoever added inline comments.Feb 19 2019, 4:59 PM
src/core/kconfigini.cpp
683

My understanding of overlong sequences was wrong. I understand now what is meant from your link.

Added tests and code fixes to deal with overlong sequences and content
above U+10FFFF.

Remove duplicate if statement.

Clean up test by using QTest data.

vandenoever added inline comments.Feb 19 2019, 6:51 PM
src/core/kconfigini.cpp
683

Thanks for the unicode explanation. I've added the checks for out of range and overlong now.

Much better, thanks.

So it's accepted?

It looks good to me.

This revision was not accepted when it landed; it landed in state Needs Review.Feb 20 2019, 4:19 PM
This revision was automatically updated to reflect the committed changes.