Improvements for Gettext entries wordwrapping
Needs ReviewPublic

Authored by azamathackimov on Jan 21 2018, 5:58 PM.

Details

Summary

Improves word wrapping code for Gettext catalog.
https://bugs.kde.org/show_bug.cgi?id=380007

Old algorithm works not right as expected - if text have unintended newlines, lokalize wraps them as required, and result goes not identical as vanilla gettext.
This produce some noise when committing PO catalogs into git.

  • Fix issue when whitespace appears as leading character in next string instead of trailing in previous.
  • New algorithm removes all newlines and then add them where they are really needed, after that goes standard word wrapping with nice max 78-wide lines.

Diff Detail

Repository
R456 Lokalize
Lint
Lint Skipped
Unit
Unit Tests Skipped
azamathackimov requested review of this revision.Jan 21 2018, 5:58 PM
azamathackimov created this revision.
lueck added a subscriber: lueck.

adding Lokalize maintainer

meskobalazs accepted this revision.Apr 18 2018, 12:01 PM
meskobalazs added a subscriber: meskobalazs.

I have tested this patch when it was submitted, it solves the annoying word wrapping problem. Could we merge this?

This revision is now accepted and ready to land.Apr 18 2018, 12:01 PM
aacid requested changes to this revision.May 20 2018, 9:45 PM
aacid added a reviewer: KDE Applications.
aacid added a subscriber: aacid.

I'm going to add a Needs changes here to remind myself to review this, given that meskobalazs is not a kde/lokalize developer i don't think it was a good idea of him marking this as accepted.

This revision now requires changes to proceed.May 20 2018, 9:45 PM

I'm going to add a Needs changes here to remind myself to review this, given that meskobalazs is not a kde/lokalize developer i don't think it was a good idea of him marking this as accepted.

You are right, of course. I didn't mark it on a whim though, I did give it three months for someone to accept it, but there was no reaction until now. By the way, I authored a very similar patch before I found that somebody already did this.
Anyway, thanks for reviewing it. Finally fixing this issue will enable our localization community to use Lokalize again.

aacid added a comment.Jun 19 2018, 9:46 PM

This is a change that would greatly benefit from an auto test, sadly we don't have to have any in lokalize.

Can you please explain the code changes?

Can you please explain the code changes?

I can't give you a line-by-line explanation, as I am not the author. However, it restores the old way as Lokalize exported the PO files - which matches other PO-based tools. Namely, in multiline msgstr strings, the spaces are put at the end of the lines, instead of the beginning of the next one.

Simon what do you think about this?

sdepiets updated this revision to Diff 39238.Aug 7 2018, 5:12 AM

Fixing patch

Simon what do you think about this?

I've just "rebased" the patch because I fixed an issue with the wordwrap already in the same file.
I need more context to understand what exactly is the issue with the current behavior to begin with. Do you have specific examples ?

With this patch this is the behavior that I see (it removes all line breaks in source and target), which seems rather unwanted in that situation.

-msgid ""
-"<center>This investment account does not contain the \"%1\" security.</center>"
-"<center>Transactions involving this security will be ignored.</center>"
-msgstr ""
-"<center>Ce compte de placement ne contient pas le titre « %1 ».</center>"
-"<center>Les opérations impliquant ce titre seront ignorées.</center>"
+msgid "<center>This investment account does not contain the \"%1\" security.</center><center>Transactions involving this security will be ignored.</center>"
+msgstr "<center>Ce compte de placement ne contient pas le titre « %1 ».</center> <center>Les opérations impliquant ce titre seront ignorées.</center>"
sdepiets requested changes to this revision.Aug 7 2018, 5:20 AM
This revision now requires changes to proceed.Aug 7 2018, 5:20 AM

It has been more than a year. Should we close this and https://bugs.kde.org/show_bug.cgi?id=380007?

We never got a proper explanation of what this does or how it works.

The person that submitted it never answered any question.

I'd say abandon this.

May be the original author is not following Phabricator. You can try and ping him in the bugs.kde.org report or by email.

Hello, it's me again.

Previous work was not ideal, so I rewrote whole thing.

aacid added inline comments.Dec 11 2019, 9:52 PM
src/catalog/gettext/gettextexport.h
77

wouldn't it make more sense if the signature was
static QStringList tokenize(const QString &text, const QRegExp &delimiters);
?

azamathackimov marked an inline comment as done.Dec 14 2019, 6:12 PM
azamathackimov added inline comments.
src/catalog/gettext/gettextexport.h
77

Well, I thought that will be non-optimal. Refactored to

static QStringList tokenize(const QString &text, const QRegExp &delimiters);
azamathackimov marked 2 inline comments as done.Dec 17 2019, 8:52 PM

Can you share a .po file where this makes a difference?

And also can you provide a better description of what it does as a commit message? Because "Improvements for Gettext entries wordwrapping" is as generic as it gets.

azamathackimov edited the summary of this revision. (Show Details)

updated diff

Can you share a .po file where this makes a difference?

And also can you provide a better description of what it does as a commit message? Because "Improvements for Gettext entries wordwrapping" is as generic as it gets.

Here example from one of my translation projects. I have properly formatted by msguniq dable.dlg.po - o dable.dlg.po file
First commit with unpatched lokalize:
https://github.com/winterheart/planescape/commit/72ae2d06b479e3e5ed8ea5e36ab5458806d094e2

Notice that I didn't made any meaningful changes to message, I only pressed Ctrl-i on active translation (remove newlines from translation) and saved file. Expected result is no changes on entry, but lokalize wrapping message incorrectly. Notice that how word wrapping adds leading spaces on each newline in message.

Second commit with patched lokalize right after unpatched commit:
https://github.com/winterheart/planescape/commit/47003b033239a260f4a286f6406594f2d5ae8477

Now lokalize wraps message correctly.

Why this is big deal? When translation being updated, usually there msgmerge command are involved that by default word wrap messages with gettext's wrap() function (https://github.com/autotools-mirror/gettext/blob/master/gettext-tools/src/write-po.c#L616). If PO-file is not word-wrapped correctly, updated file will have lots noisy and pointless changes related to wrapping, and worse - this goes to git history. This make difficult to developer and translator to see real difference. Other scenario - when two translators uses different tools to translate - say lokalize and poedit. poedit uses wrap directly from msguniq (calls it on save), and again - lots of noise and pointless changes.

gettext's wrap() function is bit complicated and has lots of legacy code and portable inlines, but key of algorithm is using libunistring's tokenizing by finding possible line breakers. Since in Qt 5 this can be done a lots easier, so I implemented tokenizer based on QTextBoundaryFinder with QTextBoundaryFinder::Line boundary detector.

When you say "incorrectly" you mean "not like i like it" or "not like gettext creates", right?

The file is just as valid, isn't it?

It may be valid, but the "standard version" produced for example by msgfmt is the one where the spaces are at the end of the line, not at the beginning.
It is worth noting that KBabel first and Lokalize later, up to a certain version which I don't remember, did produce the "properly" formatted version.

IIRC scripty reformats the messages too (as it uses msgfmt), so IMHO using the same format would be useful.