Add a po file for the list of words in GCompris
ClosedPublic

Authored by jjazeix on Dec 22 2018, 5:25 PM.

Details

Reviewers
huftis
Group Reviewers
GCompris
Localization
Summary

The aim of this diff would be to add a po containing all the words to be translated in the lang activity (https://gcompris.net/incoming/lang/words.json) to ease the work for the translators.

Multiple questions:

  1. Is it useful for translators to have it in a po format (in KDE translation repository) instead of a json file (which is only in GCompris repository)?
  2. Which information would be needed to have in the translator comments? The images given in comment should help to understand the context, not sure if more is needed.
  3. I'm really not used to awk, maybe there is some cleaner way to write what I did if someone is interested :).
Test Plan

The first idea is to fill the existing po with the words in locales already available (those in https://cgit.kde.org/gcompris.git/tree/src/activities/lang/resource) to not lose the existing work or duplicate the translators work.

Then, once the po files are ok, when there are updates, synchronize the po files with the json files (so translators would only need to update the po files, no more updating manually the json files).

Diff Detail

Lint
Lint Skipped
Unit
Unit Tests Skipped
jjazeix created this revision.Dec 22 2018, 5:25 PM
Restricted Application added a project: KDE Edu. · View Herald TranscriptDec 22 2018, 5:25 PM
Restricted Application added a subscriber: kde-edu. · View Herald Transcript
jjazeix requested review of this revision.Dec 22 2018, 5:25 PM

Will be commited next week if no complains, scripts to convert from json to po and opposite done.

aacid added a subscriber: aacid.Jan 10 2019, 10:27 PM

I think you should be using StaticMessages.sh

This allows you to get the script that merges back to your files to be run by scripty every day so you don't forget it.

We don't have much documentation but maybe reading the kdeconnect file is enough to understand how to do it?

https://cgit.kde.org/kdeconnect-android.git/tree/StaticMessages.sh

huftis requested changes to this revision.Jan 12 2019, 2:27 PM
huftis added a subscriber: huftis.

Yes, it is very useful to have it in the PO format instead of the JSON format.

The information seems sufficient. But could you split it over two lines instead, and add space after colon? I recommend:

#. Description: "alphabet"
#. Image: https://www.gcompris.net/incoming/lang/lang/words/alphabet.png

(Lokalize currently displays the two lines as a single-line string, but that’s a bug (and it’s not a big problem here): https://bugs.kde.org/show_bug.cgi?id=403142)

It’s not clear what the strings will be used for. Will the text be used even if there isn’t an audio file in the given language?

And if so, shouldn’t it be taken from content-en.json (if not empty) instead of from the file name (e.g. "seven" instead of U0037)? And if the English string in ‘content-en.json’ is missing, at least the underscore in the filename should be converted to a space (e.g. ‘air_horn’ → ‘air horn’).

This revision now requires changes to proceed.Jan 12 2019, 2:27 PM

The information seems sufficient. But could you split it over two lines instead, and add space after colon? I recommend:

#. Description: "alphabet"
#. Image: https://www.gcompris.net/incoming/lang/lang/words/alphabet.png

I can change it to:
#. otherChapter / number / 10.ogg
#. https://gcompris.net/incoming/lang/words.html#ten
#: https://gcompris.net/incoming/lang/words.html#ten
msgid "ten"
msgstr ""

This way, the text to translate is really the description, not the sound file?

It’s not clear what the strings will be used for. Will the text be used even if there isn’t an audio file in the given language?

Yes, the text will be used even if there is no audio. If the word is not translated, the english word is not used at all in the category to learn. For example, in the colors, in red is not translated, we wll display all the words except this one.
The plan is to use StaticMessages.sh to fill back the content-$lang.json. There is an internal api to convert back from the audio file to the description, we don't display the ogg file.

And if so, shouldn’t it be taken from content-en.json (if not empty) instead of from the file name (e.g. "seven" instead of U0037)? And if the English string in ‘content-en.json’ is missing, at least the underscore in the filename should be converted to a space (e.g. ‘air_horn’ → ‘air horn’).

jjazeix updated this revision to Diff 49555.Jan 15 2019, 6:44 PM

Generates po with lines looking like:
#. otherChapter / number / U0039.ogg
#. https://gcompris.net/incoming/lang/words.html#nine
#: https://gcompris.net/incoming/lang/words.html#nine
msgid "nine"
msgstr ""

#. otherChapter / action / scratch.ogg
#. https://gcompris.net/incoming/lang/words.html#scratch
#: https://gcompris.net/incoming/lang/words.html#scratch
msgid "to scratch"
msgstr ""

Pushing it in a few hours if no complains :). Note that there is a script to fill it for languages which already have translated it in the json files

huftis added a comment.EditedJan 19 2019, 12:14 PM

The link in comment section in your example doesn’t work (and why are they duplicated, BTW?):

https://gcompris.net/incoming/lang/words.html#scratch

It looks like the web site uses the description ‘to scratch’, not the ID ‘scratch’ in the links.

The following link seems to work:

https://gcompris.net/incoming/lang/words.html#to%20scratch

(Though it really shouldn’t, since spaces aren’t allowed in ‘id’ or ‘name’ attributes in HTML.)

huftis requested changes to this revision.Jan 19 2019, 12:27 PM

In StaticMessages.sh, the call to the Python script in import_po_files is commented out. Is this intentional?

src/StaticMessages.sh
22

The actual Python call is commented out. Is this intentional?

This revision now requires changes to proceed.Jan 19 2019, 12:27 PM

Also, since (AFAICS) the words are organized by section in the PO file (that’s good), perhaps you should link to

https://gcompris.net/incoming/lang/words_by_section.html
instead of to
https://gcompris.net/incoming/lang/words.html

Then the images on the Web page would be in the same order as in the PO file, which makes things easier for the translators.

A few suggested changes in the POT header.

src/activities/lang/resource/datasetToPo.py
69
72

This is typically set to
Last-Translator: FULL NAME <EMAIL@ADDRESS>\n
in POT files.

73

This is typically set to:
Language-Team: LANGUAGE <kde-i18n-doc@kde.org>\n
in KDE POT files.

jjazeix updated this revision to Diff 49919.EditedJan 20 2019, 9:59 AM

Remove duplicate comment of the image links. Update the image link to the good one when there are spaces (we'll look after to remove the spaces).
Fix the header.
Use words_by_section.html page instead of words.html

src/StaticMessages.sh
22

yes, I first want to fill back the existing po from the json files before applying it (and to be sure it saves the files well in the good place)

jjazeix updated this revision to Diff 49922.Jan 20 2019, 10:08 AM
jjazeix marked 3 inline comments as done.

missing \n

huftis requested changes to this revision.Jan 20 2019, 2:06 PM

One final, minor change in the URLs is need to make them clickable.

src/activities/lang/resource/datasetToPo.py
84

Any spaces in the URL needs to be URL-encoded for the URL to be clickable. Bascially, just replace all spaces with %20.

This revision now requires changes to proceed.Jan 20 2019, 2:06 PM
jjazeix updated this revision to Diff 49933.Jan 20 2019, 2:13 PM

Replace " " with "%20" in urls

pino added a subscriber: pino.Jan 20 2019, 2:13 PM
pino added inline comments.
src/activities/lang/resource/datasetToPo.py
84

Not only spaces, but any special character. Please use the urllib module to do the escaping properly.

jjazeix updated this revision to Diff 49934.Jan 20 2019, 2:15 PM
jjazeix marked an inline comment as done.

Replace "%20" with " " when recreating the json file

pino added inline comments.Jan 20 2019, 2:22 PM
src/activities/lang/resource/datasetToPo.py
84

Please use the urllib module to do the escaping properly, instead of a manual replace.

huftis requested changes to this revision.Jan 20 2019, 2:26 PM

In poToDataset.py, only translated strings should be included. Currently, if a translator translates ‘foo’ to ‘bar’, waits until the JSON file is regenerated, changes their mind and deletes or fuzzies the translation (“I don’t think ‘bar’ is the correct translation for ‘foo’ after all, but I’m not sure what is the correct translation yet”), the JSON file is stuck with ‘bar’ as the translation.

src/activities/lang/resource/poToDataset.py
50

This doesn’t delete old, untranslated/fuzzy entries. It should.

This revision now requires changes to proceed.Jan 20 2019, 2:26 PM
jjazeix updated this revision to Diff 49936.Jan 20 2019, 2:38 PM
jjazeix marked 3 inline comments as done.

Use urllib for both encoding and decoding.
Remove translation if fuzzy or update it if updated.

Since the filename, e.g. alarmclock.ogg, is used as the key in the JSON file, I think it would be cleaner to use it as a ‘msgctxt’ in the PO file. That way, you don’t have to try to parse the comments to extract the keys when regenerating the JSON files. And it makes it possible to have more the one image with the same ‘msgid’ (homographs with different meaning, e.g. a verb and a noun). (I don’t think there’s currently any such strings, but there may be in the future.)

src/activities/lang/resource/datasetToPo.py
88

Consider using the ‘msgctxt’ field for storing the JSON keys.

src/activities/lang/resource/poToDataset.py
43

Consider using the ‘msgctxt’ field for storing the JSON keys.

Since the filename, e.g. alarmclock.ogg, is used as the key in the JSON file, I think it would be cleaner to use it as a ‘msgctxt’ in the PO file. That way, you don’t have to try to parse the comments to extract the keys when regenerating the JSON files. And it makes it possible to have more the one image with the same ‘msgid’ (homographs with different meaning, e.g. a verb and a noun). (I don’t think there’s currently any such strings, but there may be in the future.)

I though about it at first and was afraid that the translator kept the .ogg extension in the translation. If it's safe, it will be easier. There is orange as color and fruit I think but orange-color.ogg is used for the color

huftis requested changes to this revision.Jan 20 2019, 2:49 PM

The poToDataset.py script only seems to work if 1) there already *is* a JSON file and 2) the file contains an entry for the strings in the PO file. So someone needs to manually add the JSON file and keep the entries updated to reflect the original English JSON file. Wouldn’t it be easier to just write the JSON files based on the PO file? They should contain all the information needed to generate JSON files.

This revision now requires changes to proceed.Jan 20 2019, 2:49 PM
huftis added a comment.EditedJan 20 2019, 2:53 PM

Since the filename, e.g. alarmclock.ogg, is used as the key in the JSON file, I think it would be cleaner to use it as a ‘msgctxt’ in the PO file. That way, you don’t have to try to parse the comments to extract the keys when regenerating the JSON files. And it makes it possible to have more the one image with the same ‘msgid’ (homographs with different meaning, e.g. a verb and a noun). (I don’t think there’s currently any such strings, but there may be in the future.)

I though about it at first and was afraid that the translator kept the .ogg extension in the translation. If it's safe, it will be easier. There is orange as color and fruit I think but orange-color.ogg is used for the color

If you put the ID in the msgctxt field, there won’t be a problem. It will look something like this:

msgctxt "orange-fruit.ogg"
msgid "orange"
msgstr "appelsin"

msgctxt "orange-colour.ogg"
msgid "orange"
msgstr "oransje"

In the PO editors, the translators will see the ‘msgctxt’ in a different pane than the one containing the original strings (msgids). There is no risk that they will translate the filename.

jjazeix updated this revision to Diff 49939.Jan 20 2019, 3:01 PM
jjazeix marked 2 inline comments as done.

Use msgctxt to store the key of json file.
Write a new json instead of starting from an actual one

huftis added a comment.EditedJan 20 2019, 3:12 PM

I have tested the scripts and found one bug in poToDataset.py. It also converts obsolete entries in the PO file into JSON entries.

But I have a question. Is it necessary to also output the empty (untranslated/fuzzy) entries? If not, you can fix the bug and simplify the JSON files at the same time by just using:

for entry in poFile.translated_entries():
    word = entry.msgctxt
    data[word] = entry.msgstr
jjazeix updated this revision to Diff 49941.Jan 20 2019, 3:21 PM

Only set the translated values in the json file

huftis accepted this revision.EditedJan 20 2019, 3:28 PM

I don’t know how StaticMessages.sh stuff works, so I’m not qualified to test that part. But I’ve tested the POT and JSON generator scripts, and they seem to work perfectly.

Thanks your work on this! It makes the translators’ work much easier.

This revision is now accepted and ready to land.Jan 20 2019, 3:28 PM

I have tested the scripts and found one bug in poToDataset.py. It also converts obsolete entries in the PO file into JSON entries.

But I have a question. Is it necessary to also output the empty (untranslated/fuzzy) entries? If not, you can fix the bug and simplify the JSON files at the same time by just using:

for entry in poFile.translated_entries():
    word = entry.msgctxt
    data[word] = entry.msgstr

It will work (I tried by removing all numbers of the json file excepting 2) and

I don’t know how StaticMessages.sh stuff works, so I’m not qualified to test that part. But I’ve tested the POT and JSON generator scripts, and they seem to work perfectly.

Thanks your work on this! It make the translators’ work much easier.

Thank you a lot for all the remarks! (Albert and Pino too :)). I'm pushing it and I'll check the logs tomorrow and merge the existing translations. If good, I'll uncomment the import