Projects plugin: fix git file listing for umlauts such as äöü
ClosedPublic

Authored by dhaumann on Feb 4 2018, 9:55 PM.

Details

Summary

git ls-files avoids umlauts or unicode surrogate characters.
The problem is that git ls-files outputs:

$ git ls-files | grep Der
"Der B\303\244cker/L\303\266ffler.txt"
DerBaecker/Loeffler.txt

instead of "Der Bäcker\Löffler.txt".
It uses quotes and unicode escape sequences to avoid the ä and ö.

This patch uses git ls-files -z for listing the contents. Instead
of \r\n, the file listing the dumps a bytearray that is \0 separated
for each entry.

In the -z mode, no unicode escaping is done, and the umlauts such
as äöü or any other unicode characters are displayed correctly.

There is still room for improvement, since readAllStandardOutput()
might return a very large listing, which allocates a lot of memory.
Therefore, a buffered solution (using a lambda or so) would probably
be better. This, however, can be done in a separate patch.

BUG: 389415

Test Plan

make test

Diff Detail

Repository
R40 Kate
Lint
Automatic diff as part of commit; lint not applicable.
Unit
Automatic diff as part of commit; unit tests not applicable.
dhaumann created this revision.Feb 4 2018, 9:55 PM
Restricted Application added a project: Kate. · View Herald TranscriptFeb 4 2018, 9:55 PM
dhaumann requested review of this revision.Feb 4 2018, 9:55 PM
dhaumann updated this revision to Diff 26541.
  • Use reference
dhaumann added a subscriber: Kate.Feb 5 2018, 5:51 PM

Hmm, is the encoding of the file names then the local 8 bit variant? Or is it utf-8 everywhere?
On Windows I am always bit confused when to use local8Bit.

To be honest, I do not know. The risk of introducing a regression definitely exists. Also, I do not know whether this works on Windows. What do you suggest to test / do?

I just tried in on Windows, its UTF-8.
If it is UTF-8 even there, we should just convert from that instead.

dhaumann updated this revision to Diff 27960.Feb 24 2018, 9:41 PM
  • Use QString::fromUtf8(ByteArray) for raw string interpretation
brauch accepted this revision.Feb 24 2018, 11:56 PM
brauch added a subscriber: brauch.

Looks good to me, also because null termination sounds better than \n (filenames can easily contain \n although they usually don't).

I think though old windows systems (think windows XP) don't use utf8 for filename encoding, do they?

This revision is now accepted and ready to land.Feb 24 2018, 11:56 PM

Could be, but Windows XP is not supported anymore, so we do not care.

I looked it up, it depends on the file system. NTFS always uses UTF8, but FAT uses some weird 1980's charset. So I think this breaks if you open files from FAT file systems.

Good point. Question is what git does in this case. We should be able to test this with a fat USB stick... Todo :)

Windows never uses UTF-8 ;=) Even NTFS uses UTF16/UCS2, but not UTF-8.
But git always uses utf-8 for its stuff.

cullmann accepted this revision.Feb 25 2018, 11:32 AM

Btw., if you want to see the joys of utf-8 on Windows, read: http://utf8everywhere.org/

;=)

In any case, for our use, we don't need to care for the file system encoding, git encodes in utf-8.
We later use the Qt API, that will call "the right" functions with "the right" encoding.

This revision was automatically updated to reflect the committed changes.