Add CLDR-based country to ISO code mapping
ClosedPublic

Authored by vkrause on Aug 2 2018, 5:07 PM.

Details

Summary

This replaces the many year outdated countrytransl.map file with a
compiled-in data table. This is much more complete (28k strings rather
than the previous ~5k), up-to-date (the old data set was from a time
when Yugoslavia was still a thing), and much much faster to access
(no more text file parsing for each lookup, but a binary search in
shared read-only data). However, we pay for this by needing more disk
space (670kB vs 95kB), which is largely due to the increased amount of
strings, many of which are in "expensive" encodings, the overhead per
string only increases from four to five bytes.

This is based on Sune's work in kde:scratch/sune/countrieslistcldr.

Diff Detail

Repository
R174 KContacts
Lint
Automatic diff as part of commit; lint not applicable.
Unit
Automatic diff as part of commit; unit tests not applicable.
vkrause created this revision.Aug 2 2018, 5:07 PM
Restricted Application added a project: KDE PIM. · View Herald TranscriptAug 2 2018, 5:07 PM
Restricted Application added a subscriber: kde-pim. · View Herald Transcript
vkrause requested review of this revision.Aug 2 2018, 5:07 PM
winterz added a subscriber: winterz.Aug 2 2018, 5:33 PM
winterz added inline comments.
src/generator/CMakeLists.txt
2

I don't have this path on my system.
I suppose I'll need to find it and perhaps install the necessary package.

we need CMake code that handles the missing CLDR data files more nicely

vkrause added inline comments.Aug 2 2018, 5:52 PM
src/generator/CMakeLists.txt
2

Note that this isn't needed for normal building, the generated files are checked in. You'd only need this when updating the data table from newer CLDR data. Rebuilding the data tables every time seems overkill, especially since the CLDR dataset is ~180MB of XML files, and new countries aren't added that often.

svuorela added inline comments.
src/generator/CMakeLists.txt
2

On debian, they are located in /usr/share/unicode/cldr/common/main/

src/generator/translatedcountrylist.h
30

iirc, I also did some basic unit testing of these functions. Maybe we should integrate them as well ?

no objections. looks ok to me.

src/generator/CMakeLists.txt
2

ack.

svuorela added inline comments.Aug 2 2018, 7:39 PM
src/generator/main.cpp
67

Maybe also write the (local) link to the generator code ?

vkrause added inline comments.Aug 2 2018, 7:52 PM
src/generator/translatedcountrylist.h
30

Yep, that's where some of the extra tests came from. I didn't include the tests of the intermediate parsing stage, as they would not add any extra safety, so the extra work needed to enable that seems not worth the effort.

vkrause updated this revision to Diff 38979.Aug 2 2018, 8:13 PM

Address review comments, and remove ambigious strings.

There's 4 of them:

Removing ambigious string: "nigeri" "ne" "ng"
Removing ambigious string: "конго" "cd" "cg"
Removing ambigious string: "свети мартин" "mf" "sx"
Removing ambigious string: "لتونی" "lt" "lv"

vkrause updated this revision to Diff 39016.Aug 3 2018, 2:40 PM

Be less strict when matching country names, a unique prefix is also good enough.

This makes the kitinerary test suite happy again, as we now also can handle
"United States of America" rather than just "United States", for example.

mlaurent requested changes to this revision.Aug 3 2018, 3:32 PM
mlaurent added a subscriber: mlaurent.
mlaurent added inline comments.
src/generator/main.cpp
35

new line before {

src/generator/translatedcountrylist.cpp
33

not necessary I think

62

coding style space before "&" or "*" everywhere in file.

This revision now requires changes to proceed.Aug 3 2018, 3:32 PM
vkrause updated this revision to Diff 39028.Aug 3 2018, 4:29 PM

Coding style fixes.

krake accepted this revision.Aug 13 2018, 2:17 PM
This revision was not accepted when it landed; it landed in state Needs Review.Aug 24 2018, 3:30 PM
This revision was automatically updated to reflect the committed changes.