Differential D15068

Bindings: Correct handling of sources containing utf-8
ClosedPublic
Actions

Authored by bruns on Aug 25 2018, 2:36 AM.

Details

Reviewers

lbeltrame

Group Reviewers

Frameworks

Commits

R240:28da9e614f66: Bindings: Correct handling of sources containing utf-8

Summary

Depending on the locale, python3 may try to decode the source as ASCII
when the file is opened in text mode. This will fail as soon as the
code contains utf-8, e.g. (c) symbols.

While it is possible to specify the encoding when reading the file,
this is bad for several reasons:

only a very small part of the source is processed via _read_source, no need to decode the complete source and store it as string objects
the clang Cursor.extent.{start,end}.column refers to bytes, not multibyte characters.

While python2 processes utf-8 containing sources without error messages,
wrong extent borders are also an issue.

The practical impact is low, as the issue only manifests if there is a
multibyte character in front of *and* on the same line as the read token.

Test Plan

Python3: Build any bindings which contains sources with non-ASCII codepoints,
e.g. kcoreaddons. Unpatched version fails when using e.g. LANG=C.
Python2: Both versions generate sources successfully.

Bytes vs characters test:

#define Q_SLOTS
class foo {
/* a */ public Q_SLOTS:
/* ä */ public Q_SLOTS:
};

sip_generator.py --flags "" /usr/lib64/libclang.so Qt5Ruleset.py test.h out.sip
Obviously, both lines should result in the same code, the unfixed version generates public Q_SLOTS: vs public:.

Diff Detail

Repository

R240 Extra CMake Modules

Lint

Automatic diff as part of commit; lint not applicable.

Unit

Automatic diff as part of commit; unit tests not applicable.

bruns created this revision.Aug 25 2018, 2:36 AM

Restricted Application added projects: Frameworks, Build System. · View Herald TranscriptAug 25 2018, 2:36 AM

Restricted Application added subscribers: kde-buildsystem, kde-frameworks-devel. · View Herald Transcript

Harbormaster completed remote builds in B2214: Diff 40392.Aug 25 2018, 2:36 AM

bruns requested review of this revision.Aug 25 2018, 2:36 AM

bruns edited the test plan for this revision. (Show Details)Aug 25 2018, 3:24 AM

bruns mentioned this in D14912: Bindings: Make generator forward compatible with Python 3.Aug 25 2018, 2:46 PM

Can someone review this? It is currently not possible to build the bindings with python3, e.g. for kcoreaddons.

Another solution (not tested here but used in other projects) could be to use
with open(source, "r", encoding="utf-8") as f:
(or if the file could contain the aberration BOM, you could use "utf-8-sig")

And it should convert automatically all kind of line-ends and still return text.
I'm assuming this is for python 3.5 or later.

The code itself looks fine to me - the change on line 752 is to allow handling of macOS / Windows line endings in addition to just Unix line endings I presume?

LGTM.

find-modules/sip_generator.py
753	Perhaps mention in the comment that you're doing this for all the types of newlines?

This revision is now accepted and ready to land.Sep 14 2018, 12:21 PM

In D15068#325659, @jtamate wrote:

Another solution (not tested here but used in other projects) could be to use
with open(source, "r", encoding="utf-8") as f:
(or if the file could contain the aberration BOM, you could use "utf-8-sig")

Please read the full summary again why reading the source as utf-8 is a bad ide - hint: clang expresses ranges in bytes, not codepoints.

In D15068#325752, @bcooksley wrote:

The code itself looks fine to me - the change on line 752 is to allow handling of macOS / Windows line endings in addition to just Unix line endings I presume?

Yes. Thats what python does by default when reading files in text mode (converting all newline variants to Unix newlines).

Closed by commit R240:28da9e614f66: Bindings: Correct handling of sources containing utf-8 (authored by bruns). · Explain WhySep 14 2018, 12:45 PM

This revision was automatically updated to reflect the committed changes.

Revision Contents
Changeset List

			Path	Packages
M			find-modules/sip_generator.py (6 lines)

Diff	ID	Base	Description	Created	Lint	Unit
Base			Base
Diff 1	40392	ae3b712		Aug 25 2018, 2:36 AM	★	★
Diff 2	41646	5ecea2f	R240:28da9e614f663210f4e5e0790b56055ce517bf38	Sep 14 2018, 12:45 PM	★	★

Commit	Tree	Parents	Author	Summary	Date
3131a0faeae8	199e459e62f0	ae3b712fa105	Stefan Brüns	Bindings: Correct handling of sources containing utf-8 (Show More…)	Aug 19 2018, 12:32 AM

Diff 41646

View Options

find-modules/sip_generator.py