Details

Reviewers

hein
cfeck

Summary

This patch is a request for comments.

Is this approach viable?
Who can help to extend , refine and test this?

Depends on D11587

BUG:362647

Test Plan

make test

Diff Detail

Repository

R293 Baloo

Branch

cjk (branched from master)

Lint

No Linters Available

Unit

No Unit Test Coverage

michaelh created this revision.Mar 21 2018, 1:38 PM

Restricted Application added projects: Frameworks, Baloo. · View Herald TranscriptMar 21 2018, 1:38 PM

michaelh requested review of this revision.Mar 21 2018, 1:38 PM

I have zero knowledge about Baloo, but I can add some comments regarding Unicode.

the four ranges you used are all adjacent, so you could contract to {0x4E00, 0x9FFF}
there are more ranges for CJK characters in the BMP, at least {0x3400, 0x4DBF} would be useful (I don't know if CJK users ever use the compatibility characters)
to be able to fully support the remaining CJK blocks in higher planes, the code would need to handle surrogate pairs
if Baloo doesn't handle CJK, it maybe also doesn't handle other non-Latin scripts, so I suggest to use QChar::category()

Regarding this - I don't know if it is really chinese look foreign enough to me anyway.
Some lines of text in your test script surely look like Japanese Hiragana to me, especially this one (and tests related to this)

echo "otto东到宛平路anna"> "終末なにしてますか？忙しいですか？救ってもらっていいですか？ EP01 太阳の倾いたこの世界で -broken chronograph-.txt"

Google translate confirms :)
But do your ranges include that characters? This answer on stackoverflow says that there are also other ranges for Hiragana, Katakana, etc... as @cfeck already said. Does it pass the test for you?

link to tables from answer

In D11552#230870, @alexeymin wrote:
Regarding this - I don't know if it is really chinese look foreign enough to me anyway.
Some lines of text in your test script surely look like Japanese Hiragana to me, especially this one (and tests related to this)
echo "otto东到宛平路anna"> "終末なにしてますか？忙しいですか？救ってもらっていいですか？ EP01 太阳の倾いたこの世界で -broken chronograph-.txt"

That's the only thing I was sure of (It was in fact an mkv I just watched). At this stage the actual language does not really matter.

But do your ranges include that characters? This answer on stackoverflow says that there are also other ranges for Hiragana, Katakana, etc... as @cfeck already said.

My rationale was not to throw in every range mentioned on that wikipedia page, but just enough to make this work and illustrate the general approach.

Does it pass the test for you?

All except the last two that is '*ですか？ EP01' (<mixture of Latin/Hiragana) and 'ですか' (<pure Hiragana). I could lie now and say I left out Hiragana character on purpose. I didn't, but for Hiragana the one grapheme = one search term does not apply. So those tests in fact should fail.

@cfeck

if Baloo doesn't handle CJK, it maybe also doesn't handle other non-Latin scripts, so I suggest to use QChar::category()

I wasn't aware of QChar::category() . Thank you.

michaelh removed reviewers: Baloo, Frameworks, lbeltrame, bruns.Mar 21 2018, 6:04 PM

michaelh added subscribers: lbeltrame, bruns.

Restricted Application added a subscriber: Frameworks. · View Herald TranscriptMar 21 2018, 6:04 PM

broulik added a reviewer: hein.Mar 21 2018, 9:48 PM

If you're going to loop over a QString and break it down into QChars anyway, why don't you just use QChar::script?

For the record though - a better way to do this is to use QTextBoundaryFinder which will operate e.g. on grapheme cluster boundaries. This still isn't super great for Chinese though. If you want to really-properly do it you'll end up depending on ICU and using its BreakIterator combined with dict-based support for Chinese, which isn't terribly fast however.

Base on D11587
Make tests pass
Use QChar.script() and QChar.isLetter()

michaelh edited the summary of this revision. (Show Details)Mar 22 2018, 8:16 PM

michaelh edited the test plan for this revision. (Show Details)

michaelh added a dependency: D11587: [WIP] autotests: Introduce TermGeneratorTestUTF.

cfeck added inline comments.Mar 22 2018, 8:58 PM

src/engine/characterrangescjk.cpp
36	Add surrogate pair handling. Basic outline: uint c = text.at(i); if (QChar::isSurrogate(c)) { c = QChar::surrogateToUcs4(c, text.at(++i)); } if (QChar::isLetter(c) ... You would need to add 'i' bounds checking, and verifying that you are indeed seeing a valid pair.

Add surrogate pair handling
Remove obsolete #include

michaelh added inline comments.Mar 22 2018, 9:31 PM

src/engine/characterrangescjk.cpp
41	Like that? Only 60% aware of what I'm doing here. Do you know of an example text I can incorporate into the test? All I've got so far is tables, numbers and such, no text.

In D11552#231330, @hein wrote:

For the record though - a better way to do this is to use QTextBoundaryFinder which will operate e.g. on grapheme cluster boundaries. This still isn't super great for Chinese though. If you want to really-properly do it you'll end up depending on ICU and using its BreakIterator combined with dict-based support for Chinese, which isn't terribly fast however.

There are a few implications here:

splitting to much generates to unspecific terms, especially in case of full text indexing (Think of splitting a western language at character level, most texts likely contain almost the full alphabet. Same likely applies to Katakana with its about ~100 graphemes)
term generation at query and index time have to agree about what a term is, otherwise a search will likely return nothing. Changing the splitting at a later time will require reindexing all affected files
better splitting will cost some more time at index generation, but likely makes searching faster (additional time for term generation will be neglegible, but the search terms are less complex - e.g. "abc" instead of "a" AND "b" AND "c").

In D11552#231784, @bruns wrote:

In D11552#231330, @hein wrote:

For the record though - a better way to do this is to use QTextBoundaryFinder which will operate e.g. on grapheme cluster boundaries. This still isn't super great for Chinese though. If you want to really-properly do it you'll end up depending on ICU and using its BreakIterator combined with dict-based support for Chinese, which isn't terribly fast however.

There are a few implications here:

splitting to much generates to unspecific terms, especially in case of full text indexing (Think of splitting a western language at character level, most texts likely contain almost the full alphabet. Same likely applies to Katakana with its about ~100 graphemes)

term generation at query and index time have to agree about what a term is, otherwise a search will likely return nothing. Changing the splitting at a later time will require reindexing all affected files

better splitting will cost some more time at index generation, but likely makes searching faster (additional time for term generation will be neglegible, but the search terms are less complex - e.g. "abc" instead of "a" AND "b" AND "c").

Currently termgenerator uses QTextBoundaryFinder bf(QTextBoundaryFinder::Word, text);

cfeck requested changes to this revision.Mar 22 2018, 10:31 PM

cfeck added inline comments.

src/engine/characterrangescjk.cpp
39	You need to use uint to store the full character. QChar is not a character, it is just one UTF-16 codeword. Additionally, use the QChar::name(uint) static methods to operate on uint characters.

This revision now requires changes to proceed.Mar 22 2018, 10:31 PM

cfeck added inline comments.Mar 22 2018, 10:37 PM

src/engine/characterrangescjk.cpp
42	To add a uint to a QStringList, convert the uint character to a QString. Either manually compose the surrogates (faster, but uglier code), or use QString::fromUcs4() (slower, but nicer to read).

Correct surrogate pair handling

The Unicode handling looks correct.

This revision is now accepted and ready to land.Mar 23 2018, 8:46 PM

cfeck resigned from this revision.Mar 23 2018, 8:46 PM

This revision now requires review to proceed.Mar 23 2018, 8:46 PM

cfeck added inline comments.Mar 23 2018, 8:54 PM

autotests/unit/engine/termgeneratortestutf.cpp
335	Looking at http://www.unicode.org/roadmaps/sip/ I would suggest to use U+2A6FF instead (U+2CEB0 is used in newer Unicode versions).

@cfeck: Thanks a lot for your help.

Retain term positions
Optimize
Apply suggested change

michaelh marked 5 inline comments as done.Mar 24 2018, 4:47 PM

rasphino added a subscriber: rasphino.Jun 12 2019, 2:33 AM

Restricted Application edited subscribers, added: Baloo, kde-frameworks-devel; removed: Frameworks. · View Herald TranscriptJun 12 2019, 2:33 AM

fancyzhang added a subscriber: fancyzhang.Aug 26 2019, 1:21 PM

		Path
M		autotests/unit/engine/termgeneratortestutf.cpp (51 lines)
M		src/engine/CMakeLists.txt (1 line)
A	M	src/engine/characterrangescjk.h (53 lines)
A	M	src/engine/characterrangescjk.cpp (62 lines)
M		src/engine/queryparser.cpp (6 lines)
M		src/engine/termgenerator.h (1 line)
M		src/engine/termgenerator.cpp (64 lines)

Diff	ID	Base	Description	Created	Lint	Unit
Base			Base
Diff 1	30125	b050c0a		Mar 21 2018, 1:30 PM	★	★
Diff 2	30252	99f36b1	- Base on D11587	Mar 22 2018, 8:12 PM	★	★
Diff 3	30257	f75f55b	- Add surrogate pair handling	Mar 22 2018, 9:28 PM	★	★
Diff 4	30337	c6ae1c9	- Correct surrogate pair handling	Mar 23 2018, 5:16 PM	★	★
Diff 5	30410	fabbb77	- Retain term positions	Mar 24 2018, 4:41 PM	★	★

Commit	Tree	Parents	Author	Summary	Date
f34554d31b5b	56e628958f1c	23ef8f9078fa	Michael Heidelbach	Apply suggested change	Mar 24 2018, 4:41 PM
23ef8f9078fa	eb05a70d774e	8f93740e558c	Michael Heidelbach	Optimize	Mar 24 2018, 4:24 PM
8f93740e558c	1bc4cec5cb0c	88dc0911e428	Michael Heidelbach	Adjust terms order and positions	Mar 24 2018, 11:54 AM
88dc0911e428	788d4847ddef	0d7c2fff7a9e	Michael Heidelbach	Correct surrogate pair handling	Mar 23 2018, 5:17 PM
0d7c2fff7a9e	db05dd1c8050	251b4f447870	Michael Heidelbach	Remove obsolete #include	Mar 22 2018, 9:29 PM
251b4f447870	2597bab1624b	c57e408061cf	Michael Heidelbach	Add surrogate pair handling	Mar 22 2018, 9:27 PM
c57e408061cf	33b38781ea93	58a27c50e82a	Michael Heidelbach	Test for QChar.isLetter()	Mar 22 2018, 8:11 PM
58a27c50e82a	d798e46b1ec5	a64afa47e6c6	Michael Heidelbach	Use QChar.script()	Mar 22 2018, 7:28 PM
a64afa47e6c6	98a73ad8c5bd	7e6ba489e747	Michael Heidelbach	Make tests pass	Mar 22 2018, 7:06 PM
7e6ba489e747	3ecfa4a833d5	9607e80ec181	Michael Heidelbach	Correct merge errors	Mar 22 2018, 6:54 PM
9607e80ec181	c4fd270697cf	fabbb77c00ad	Michael Heidelbach	termgeneratortest: Add more scripts	Mar 22 2018, 11:45 AM

	Status	Author	Revision
	Needs Review	michaelh	D11552 [WIP] Handle CJK characters
	Needs Review	michaelh	D11587 [WIP] autotests: Introduce TermGeneratorTestUTF

[WIP] Handle CJK characters
Needs ReviewPublic
Actions

Details

Diff Detail

Revision Contents
Changeset List

Diff 30410

autotests/unit/engine/termgeneratortestutf.cpp

src/engine/CMakeLists.txt

src/engine/characterrangescjk.h

src/engine/characterrangescjk.cpp

src/engine/queryparser.cpp

src/engine/termgenerator.h

src/engine/termgenerator.cpp

[WIP] Handle CJK charactersNeeds ReviewPublicActions

Details

Diff Detail

Revision ContentsChangeset List

Diff 30410

autotests/unit/engine/termgeneratortestutf.cpp

src/engine/CMakeLists.txt

src/engine/characterrangescjk.h

src/engine/characterrangescjk.cpp

src/engine/queryparser.cpp

src/engine/termgenerator.h

src/engine/termgenerator.cpp

[WIP] Handle CJK characters
Needs ReviewPublic
Actions

Revision Contents
Changeset List