[WIP] Handle CJK characters
Needs ReviewPublic

Authored by michaelh on Mar 21 2018, 1:38 PM.

Details

Reviewers
hein
cfeck
Summary

This patch is a request for comments.

  1. Is this approach viable?
  2. Who can help to extend , refine and test this?

Depends on D11587

BUG:362647

Test Plan

make test

Diff Detail

Repository
R293 Baloo
Branch
cjk (branched from master)
Lint
No Linters Available
Unit
No Unit Test Coverage
michaelh created this revision.Mar 21 2018, 1:38 PM
Restricted Application added projects: Frameworks, Baloo. · View Herald TranscriptMar 21 2018, 1:38 PM
michaelh requested review of this revision.Mar 21 2018, 1:38 PM
cfeck added a subscriber: cfeck.EditedMar 21 2018, 2:22 PM

I have zero knowledge about Baloo, but I can add some comments regarding Unicode.

  • the four ranges you used are all adjacent, so you could contract to {0x4E00, 0x9FFF}
  • there are more ranges for CJK characters in the BMP, at least {0x3400, 0x4DBF} would be useful (I don't know if CJK users ever use the compatibility characters)
  • to be able to fully support the remaining CJK blocks in higher planes, the code would need to handle surrogate pairs
  • if Baloo doesn't handle CJK, it maybe also doesn't handle other non-Latin scripts, so I suggest to use QChar::category()
alexeymin added a subscriber: alexeymin.EditedMar 21 2018, 2:47 PM

Regarding this - I don't know if it is really chinese look foreign enough to me anyway.
Some lines of text in your test script surely look like Japanese Hiragana to me, especially this one (and tests related to this)

echo "otto东到宛平路anna"> "終末なにしてますか?忙しいですか?救ってもらっていいですか? EP01 太阳の倾いたこの世界で -broken chronograph-.txt"

Google translate confirms :)
But do your ranges include that characters? This answer on stackoverflow says that there are also other ranges for Hiragana, Katakana, etc... as @cfeck already said. Does it pass the test for you?

link to tables from answer

Regarding this - I don't know if it is really chinese look foreign enough to me anyway.
Some lines of text in your test script surely look like Japanese Hiragana to me, especially this one (and tests related to this)

echo "otto东到宛平路anna"> "終末なにしてますか?忙しいですか?救ってもらっていいですか? EP01 太阳の倾いたこの世界で -broken chronograph-.txt"

That's the only thing I was sure of (It was in fact an mkv I just watched). At this stage the actual language does not really matter.

But do your ranges include that characters? This answer on stackoverflow says that there are also other ranges for Hiragana, Katakana, etc... as @cfeck already said.

My rationale was not to throw in every range mentioned on that wikipedia page, but just enough to make this work and illustrate the general approach.

Does it pass the test for you?

All except the last two that is '*ですか? EP01' (<mixture of Latin/Hiragana) and 'ですか' (<pure Hiragana). I could lie now and say I left out Hiragana character on purpose. I didn't, but for Hiragana the one grapheme = one search term does not apply. So those tests in fact should fail.

@cfeck

if Baloo doesn't handle CJK, it maybe also doesn't handle other non-Latin scripts, so I suggest to use QChar::category()

I wasn't aware of QChar::category() . Thank you.

Restricted Application added a subscriber: Frameworks. · View Herald TranscriptMar 21 2018, 6:04 PM
hein added a comment.Mar 22 2018, 7:30 AM

If you're going to loop over a QString and break it down into QChars anyway, why don't you just use QChar::script?

hein added a comment.Mar 22 2018, 7:37 AM

For the record though - a better way to do this is to use QTextBoundaryFinder which will operate e.g. on grapheme cluster boundaries. This still isn't super great for Chinese though. If you want to really-properly do it you'll end up depending on ICU and using its BreakIterator combined with dict-based support for Chinese, which isn't terribly fast however.

michaelh updated this revision to Diff 30252.Mar 22 2018, 8:13 PM
michaelh edited the test plan for this revision. (Show Details)
  • Base on D11587
  • Make tests pass
  • Use QChar.script() and QChar.isLetter()
michaelh edited the summary of this revision. (Show Details)Mar 22 2018, 8:16 PM
michaelh edited the test plan for this revision. (Show Details)
cfeck added inline comments.Mar 22 2018, 8:58 PM
src/engine/characterrangescjk.cpp
36

Add surrogate pair handling. Basic outline:

uint c = text.at(i);
if (QChar::isSurrogate(c)) {
    c = QChar::surrogateToUcs4(c, text.at(++i));
}
if (QChar::isLetter(c) ...

You would need to add 'i' bounds checking, and verifying that you are indeed seeing a valid pair.

michaelh updated this revision to Diff 30257.Mar 22 2018, 9:28 PM
  • Add surrogate pair handling
  • Remove obsolete #include
michaelh added inline comments.Mar 22 2018, 9:31 PM
src/engine/characterrangescjk.cpp
41

Like that? Only 60% aware of what I'm doing here.
Do you know of an example text I can incorporate into the test?
All I've got so far is tables, numbers and such, no text.

bruns added a comment.Mar 22 2018, 9:31 PM
In D11552#231330, @hein wrote:

For the record though - a better way to do this is to use QTextBoundaryFinder which will operate e.g. on grapheme cluster boundaries. This still isn't super great for Chinese though. If you want to really-properly do it you'll end up depending on ICU and using its BreakIterator combined with dict-based support for Chinese, which isn't terribly fast however.

There are a few implications here:

  • splitting to much generates to unspecific terms, especially in case of full text indexing (Think of splitting a western language at character level, most texts likely contain almost the full alphabet. Same likely applies to Katakana with its about ~100 graphemes)
  • term generation at query and index time have to agree about what a term is, otherwise a search will likely return nothing. Changing the splitting at a later time will require reindexing all affected files
  • better splitting will cost some more time at index generation, but likely makes searching faster (additional time for term generation will be neglegible, but the search terms are less complex - e.g. "abc" instead of "a" AND "b" AND "c").
In D11552#231330, @hein wrote:

For the record though - a better way to do this is to use QTextBoundaryFinder which will operate e.g. on grapheme cluster boundaries. This still isn't super great for Chinese though. If you want to really-properly do it you'll end up depending on ICU and using its BreakIterator combined with dict-based support for Chinese, which isn't terribly fast however.

There are a few implications here:

  • splitting to much generates to unspecific terms, especially in case of full text indexing (Think of splitting a western language at character level, most texts likely contain almost the full alphabet. Same likely applies to Katakana with its about ~100 graphemes)
  • term generation at query and index time have to agree about what a term is, otherwise a search will likely return nothing. Changing the splitting at a later time will require reindexing all affected files
  • better splitting will cost some more time at index generation, but likely makes searching faster (additional time for term generation will be neglegible, but the search terms are less complex - e.g. "abc" instead of "a" AND "b" AND "c").

Currently termgenerator uses QTextBoundaryFinder bf(QTextBoundaryFinder::Word, text);

cfeck requested changes to this revision.Mar 22 2018, 10:31 PM
cfeck added inline comments.
src/engine/characterrangescjk.cpp
39

You need to use uint to store the full character. QChar is *not* a character, it is just one UTF-16 codeword.

Additionally, use the QChar::name(uint) static methods to operate on uint characters.

This revision now requires changes to proceed.Mar 22 2018, 10:31 PM
cfeck added inline comments.Mar 22 2018, 10:37 PM
src/engine/characterrangescjk.cpp
42

To add a uint to a QStringList, convert the uint character to a QString. Either manually compose the surrogates (faster, but uglier code), or use QString::fromUcs4() (slower, but nicer to read).

michaelh updated this revision to Diff 30337.Mar 23 2018, 5:16 PM
  • Correct surrogate pair handling
cfeck accepted this revision.Mar 23 2018, 8:46 PM

The Unicode handling looks correct.

This revision is now accepted and ready to land.Mar 23 2018, 8:46 PM
cfeck resigned from this revision.Mar 23 2018, 8:46 PM
This revision now requires review to proceed.Mar 23 2018, 8:46 PM
cfeck added inline comments.Mar 23 2018, 8:54 PM
autotests/unit/engine/termgeneratortestutf.cpp
335

Looking at http://www.unicode.org/roadmaps/sip/ I would suggest to use U+2A6FF instead (U+2CEB0 is used in newer Unicode versions).

@cfeck: Thanks a lot for your help.

michaelh updated this revision to Diff 30410.Mar 24 2018, 4:41 PM
  • Retain term positions
  • Optimize
  • Apply suggested change
michaelh marked 5 inline comments as done.Mar 24 2018, 4:47 PM
Restricted Application edited subscribers, added: Baloo, kde-frameworks-devel; removed: Frameworks. · View Herald TranscriptJun 12 2019, 2:33 AM