Bindings: Correct handling of sources containing utf-8
ClosedPublic

Authored by bruns on Aug 25 2018, 2:36 AM.

Details

Summary

Depending on the locale, python3 may try to decode the source as ASCII
when the file is opened in text mode. This will fail as soon as the
code contains utf-8, e.g. (c) symbols.

While it is possible to specify the encoding when reading the file,
this is bad for several reasons:

  • only a very small part of the source is processed via _read_source, no need to decode the complete source and store it as string objects
  • the clang Cursor.extent.{start,end}.column refers to bytes, not multibyte characters.

While python2 processes utf-8 containing sources without error messages,
wrong extent borders are also an issue.

The practical impact is low, as the issue only manifests if there is a
multibyte character in front of *and* on the same line as the read token.

Test Plan

Python3: Build any bindings which contains sources with non-ASCII codepoints,
e.g. kcoreaddons. Unpatched version fails when using e.g. LANG=C.
Python2: Both versions generate sources successfully.

Bytes vs characters test:

#define Q_SLOTS
class foo {
/* a */ public Q_SLOTS:
/* ä */ public Q_SLOTS:
};

sip_generator.py --flags "" /usr/lib64/libclang.so Qt5Ruleset.py test.h out.sip
Obviously, both lines should result in the same code, the unfixed version generates public Q_SLOTS: vs public:.

Diff Detail

Repository
R240 Extra CMake Modules
Lint
Automatic diff as part of commit; lint not applicable.
Unit
Automatic diff as part of commit; unit tests not applicable.
bruns created this revision.Aug 25 2018, 2:36 AM
Restricted Application added projects: Frameworks, Build System. · View Herald TranscriptAug 25 2018, 2:36 AM
Restricted Application added subscribers: kde-buildsystem, kde-frameworks-devel. · View Herald Transcript
bruns requested review of this revision.Aug 25 2018, 2:36 AM
bruns edited the test plan for this revision. (Show Details)Aug 25 2018, 3:24 AM
bruns added a comment.Sep 13 2018, 8:40 PM

Can someone review this? It is currently not possible to build the bindings with python3, e.g. for kcoreaddons.

Another solution (not tested here but used in other projects) could be to use
with open(source, "r", encoding="utf-8") as f:
(or if the file could contain the aberration BOM, you could use "utf-8-sig")

And it should convert automatically all kind of line-ends and still return text.
I'm assuming this is for python 3.5 or later.

The code itself looks fine to me - the change on line 752 is to allow handling of macOS / Windows line endings in addition to just Unix line endings I presume?

lbeltrame accepted this revision.Sep 14 2018, 12:21 PM
lbeltrame added a subscriber: lbeltrame.

LGTM.

find-modules/sip_generator.py
753

Perhaps mention in the comment that you're doing this for all the types of newlines?

This revision is now accepted and ready to land.Sep 14 2018, 12:21 PM

Another solution (not tested here but used in other projects) could be to use
with open(source, "r", encoding="utf-8") as f:
(or if the file could contain the aberration BOM, you could use "utf-8-sig")

Please read the full summary again why reading the source as utf-8 is a bad ide - hint: clang expresses ranges in bytes, not codepoints.

The code itself looks fine to me - the change on line 752 is to allow handling of macOS / Windows line endings in addition to just Unix line endings I presume?

Yes. Thats what python does by default when reading files in text mode (converting all newline variants to Unix newlines).

This revision was automatically updated to reflect the committed changes.