This patch is a request for comments.
- Is this approach viable?
- Who can help to extend , refine and test this?
Depends on D11587
BUG:362647
No Linters Available |
No Unit Test Coverage |
I have zero knowledge about Baloo, but I can add some comments regarding Unicode.
Regarding this - I don't know if it is really chinese look foreign enough to me anyway.
Some lines of text in your test script surely look like Japanese Hiragana to me, especially this one (and tests related to this)
echo "otto东到宛平路anna"> "終末なにしてますか?忙しいですか?救ってもらっていいですか? EP01 太阳の倾いたこの世界で -broken chronograph-.txt"
Google translate confirms :)
But do your ranges include that characters? This answer on stackoverflow says that there are also other ranges for Hiragana, Katakana, etc... as @cfeck already said. Does it pass the test for you?
That's the only thing I was sure of (It was in fact an mkv I just watched). At this stage the actual language does not really matter.
But do your ranges include that characters? This answer on stackoverflow says that there are also other ranges for Hiragana, Katakana, etc... as @cfeck already said.
My rationale was not to throw in every range mentioned on that wikipedia page, but just enough to make this work and illustrate the general approach.
Does it pass the test for you?
All except the last two that is '*ですか? EP01' (<mixture of Latin/Hiragana) and 'ですか' (<pure Hiragana). I could lie now and say I left out Hiragana character on purpose. I didn't, but for Hiragana the one grapheme = one search term does not apply. So those tests in fact should fail.
if Baloo doesn't handle CJK, it maybe also doesn't handle other non-Latin scripts, so I suggest to use QChar::category()
I wasn't aware of QChar::category() . Thank you.
If you're going to loop over a QString and break it down into QChars anyway, why don't you just use QChar::script?
For the record though - a better way to do this is to use QTextBoundaryFinder which will operate e.g. on grapheme cluster boundaries. This still isn't super great for Chinese though. If you want to really-properly do it you'll end up depending on ICU and using its BreakIterator combined with dict-based support for Chinese, which isn't terribly fast however.
src/engine/characterrangescjk.cpp | ||
---|---|---|
36 | Add surrogate pair handling. Basic outline: uint c = text.at(i); if (QChar::isSurrogate(c)) { c = QChar::surrogateToUcs4(c, text.at(++i)); } if (QChar::isLetter(c) ... You would need to add 'i' bounds checking, and verifying that you are indeed seeing a valid pair. |
src/engine/characterrangescjk.cpp | ||
---|---|---|
41 | Like that? Only 60% aware of what I'm doing here. |
There are a few implications here:
Currently termgenerator uses QTextBoundaryFinder bf(QTextBoundaryFinder::Word, text);
src/engine/characterrangescjk.cpp | ||
---|---|---|
39 | You need to use uint to store the full character. QChar is *not* a character, it is just one UTF-16 codeword. Additionally, use the QChar::name(uint) static methods to operate on uint characters. |
src/engine/characterrangescjk.cpp | ||
---|---|---|
42 | To add a uint to a QStringList, convert the uint character to a QString. Either manually compose the surrogates (faster, but uglier code), or use QString::fromUcs4() (slower, but nicer to read). |
autotests/unit/engine/termgeneratortestutf.cpp | ||
---|---|---|
335 | Looking at http://www.unicode.org/roadmaps/sip/ I would suggest to use U+2A6FF instead (U+2CEB0 is used in newer Unicode versions). |