Fix name grouping feature for cyrillic names
ClosedPublic

Authored by AndreyYashkin on Jul 7 2019, 8:54 AM.

Details

Summary

All files and folders with cyrillic names are placed in latin 'A' group. This patch fixes this issue.

BUG: 406867

Test Plan

Make grouping by name in the dir with cyrillic files or dirs in it.
See screenshots below.
Before


After

Diff Detail

Repository
R318 Dolphin
Lint
Lint Skipped
Unit
Unit Tests Skipped
AndreyYashkin created this revision.Jul 7 2019, 8:54 AM
Restricted Application added a subscriber: kfm-devel. · View Herald TranscriptJul 7 2019, 8:54 AM
AndreyYashkin requested review of this revision.Jul 7 2019, 8:54 AM
AndreyYashkin edited the summary of this revision. (Show Details)Jul 7 2019, 9:20 AM
AndreyYashkin edited the test plan for this revision. (Show Details)
ngraham edited the summary of this revision. (Show Details)Jul 7 2019, 11:55 AM
ngraham accepted this revision.Jul 7 2019, 12:02 PM
ngraham added subscribers: elvisangelaccio, ngraham.

Thanks, very nice first patch! This looks like the correct fix to me, and doesn't break grouping with the Latin alphabet or any of the unit tests. And you even used arc to submit the patch, so I don't have to ask for your email address. All in all, solid work!

@elvisangelaccio?

This revision is now accepted and ready to land.Jul 7 2019, 12:02 PM

Thanks, very nice first patch! This looks like the correct fix to me, and doesn't break grouping with the Latin alphabet or any of the unit tests. And you even used arc to submit the patch, so I don't have to ask for your email address. All in all, solid work!

@elvisangelaccio?

Hi. I cannot land my patch. I use https://community.kde.org/Infrastructure/Phabricator tutorial, but something is going wrong and I don't understand what...

andrey@METALMACHINE:~/.../dolphin$ arc land --preview
Landing current branch 'fix_for_bug_406867'.
 TARGET  Landing onto "master", selected by following tracking branches upstream to the closest remote.
 REMOTE  Using remote "origin", selected by following tracking branches upstream to the closest remote.
 FETCH  Fetching origin/master...
This commit will be landed:

      - 80454cbf1 Fix name grouping feature for cyrillic names

 PREVIEW  Completed preview of operation.


andrey@METALMACHINE:~/.../dolphin$ arc land
Landing current branch 'fix_for_bug_406867'.
 TARGET  Landing onto "master", selected by following tracking branches upstream to the closest remote.
 REMOTE  Using remote "origin", selected by following tracking branches upstream to the closest remote.
 FETCH  Fetching origin/master...
This commit will be landed:

      - 80454cbf1 Fix name grouping feature for cyrillic names

Landing revision 'D22303: Fix name grouping feature for cyrillic names'...
 BUILDS PASSED  Harbormaster builds for the active diff completed successfully.
 PUSHING  Pushing changes to "origin/master".
fatal: unable to access 'https://anongit.kde.org/dolphin.git/': The requested URL returned error: 403
Usage Exception: Push failed! Fix the error and run "arc land" again.

If you don't have a contributor account, someone who does needs to land the patch on your behalf. Don't worry, it'll happen! I just want Dolphin's maintainer @elvisangelaccio to give his OK first.

If you don't have a contributor account, someone who does needs to land the patch on your behalf. Don't worry, it'll happen! I just want Dolphin's maintainer @elvisangelaccio to give his OK first.

Thanks for response! Now it is time to fix something else.

Yay! \o/

Looks like you've caught the KDE bug. :)

Thanks for the patch!

Given that the correctness of this code depends on the locale, I'm not confident we won't break some corner cases.
Ideally we'd need more unit tests in KFileItemModelTest::testNameRoleGroups(), but I understand that's a lot to ask.

@cfeck in the bug report suggested to add more letters ranges. @AndreyYashkin Did you try that?

Thanks for the patch!

Given that the correctness of this code depends on the locale, I'm not confident we won't break some corner cases.
Ideally we'd need more unit tests in KFileItemModelTest::testNameRoleGroups(), but I understand that's a lot to ask.

@cfeck in the bug report suggested to add more letters ranges. @AndreyYashkin Did you try that?

As I understand the puprose of vector "lettersAtoZ" is to put words like "Ottava" and german "Österreich" in one latin "O" group. It seems to be some feature of german language that was taken into account by the author of code. I cannot speak for all who use cyrillic script, but in russian we do not expect anything like this. Thatwhy I discarded the idea of one more range. In addition, it may be wrong to apply such behavior for all languages that use extended latin alphabets.

QCollator thinks that cyrillic symbols are less than latin characters, while chinese and others are bigger. For this reason, it is necessary to put them in a new group without searching them in the vector or they will be putted in latin "A" group.

AndreyYashkin added a comment.EditedJul 8 2019, 11:29 AM

By the way. Even now it is broken for corner cases and I can't see any simple solution for this.

cfeck added a comment.EditedJul 10 2019, 7:52 PM

newGroupValue = newFirstChar appears twice. Could it be moved before the if as the default value to reduce the else nesting?

I've deleted the else repeating and added warning about founded wrong behavior for Z group corner case.


Solving this requires another method for separation Latin and non Latin characters.

Fixed brackets and spaces style

elvisangelaccio accepted this revision.Jul 17 2019, 7:32 PM

Thanks for the patch!

Given that the correctness of this code depends on the locale, I'm not confident we won't break some corner cases.
Ideally we'd need more unit tests in KFileItemModelTest::testNameRoleGroups(), but I understand that's a lot to ask.

@cfeck in the bug report suggested to add more letters ranges. @AndreyYashkin Did you try that?

As I understand the puprose of vector "lettersAtoZ" is to put words like "Ottava" and german "Österreich" in one latin "O" group. It seems to be some feature of german language that was taken into account by the author of code. I cannot speak for all who use cyrillic script, but in russian we do not expect anything like this. Thatwhy I discarded the idea of one more range. In addition, it may be wrong to apply such behavior for all languages that use extended latin alphabets.

QCollator thinks that cyrillic symbols are less than latin characters, while chinese and others are bigger. For this reason, it is necessary to put them in a new group without searching them in the vector or they will be putted in latin "A" group.

I see. I'd say let's push this patch and let's wait for feedback from 19.08 beta users.

You don't have commit access, do you? What's the email address we should use for the git autorship?

I found it in the bug report: andreyyashkin@gmail.com

hein added a subscriber: hein.EditedJul 18 2019, 4:06 AM

FYI: I gave this patch a spin with Korean and the grouping is not quite working the right way.

In the Korean alphabet, multiple letters combine into morpho-syllabic blocks. For example, there's the consonant 'ㅋ' (k) and the vowels 'ㅏ' (a) and 'ㅗ' (o). When used together in in a syllable, they're written as '카' and '코' (ka and ko). You'll notice '카' and '코' are single characters (comprised of two letters each). Unicode contains code points for both the individual letters and the pre-composed syllable forms. So those syllables can be encoded in two ways, as pairs of code points (i.e. combining characters) or single code points. This is similar to the way German umlauts work in Unicode, where 'ä' can be encoded either as 'a' + the combining character for diactrics, or as the pre-composed code point. In Unicode these different variants are called normalization forms and there's standardized algorithms for converting between them.

The grouping code in Dolphin currently naively operates on the character level without applying normalization first. Which means that words starting with '카' and '코' are put into two seperate groups, even though their starting letter is actually the same ('ㅋ'). They should be in the same group and collated properly within, as a Korean dictionary would do.

It may not seem that way, but that's actually the same problem that this code is trying to solve by special-casing Latin: Normalizing could also be used to drop diacritics and accents from the letters to group them together.

I'd say it's up to the Dolphin maintainership to see if they want this patch as stop-gap or now, but eventually this code will need to be rewritten properly so it does the right thing generically for all scripts.

It's really a Unicode support problem, not a Latin or Cyrillic problem. A good rule of thumb: If you're writing special sauce code that treats a particular writing system individually, you're likely (though not always, because the world is complicated :) doing it wrong, and a title like "fix xyz for <specific writing system>" is a red flag.

AndreyYashkin updated this revision to Diff 61963.EditedJul 18 2019, 10:05 AM

I used @hein idea to apply normalization which is implemented in unicode. Now it seems to be the final solution without working with particular writing system individually.

AndreyYashkin added a comment.EditedJul 18 2019, 10:54 AM

I am sorry. I was so happy with this simple solution that forgot about one impornat aspect which is different from from language to language. If in a German dictionary 'O' and 'Ö' would be in the same group, in Russian 'Е' and 'Ё' or 'И' and 'Й' can never be together like new patch does. Probably, it is applicable to Korean, but in doesen't solve problem for all languages.

Previous Diff 61904 was correct.

Previous Diff 61904 was correct.

Did that work also with Korean?

Previous Diff 61904 was correct.

Did that work also with Korean?

It solves only the problem with cyrillic names and doesn't change behavior for other non latin characters. Unfortunately, I don't have enough qualification improve a situation with others writing systems.

elvisangelaccio accepted this revision.Jul 28 2019, 3:12 PM

Well, I like that we are replacing 20+ lines of code with only one ;)

I'll push on master only so that we can let the users test this change.

This revision was automatically updated to reflect the committed changes.