Fix Sonnet autodetect failing on Indian langs
ClosedPublic

Authored by waqar on Nov 23 2019, 6:22 PM.

Details

Summary

Fixes Bug 176537. Sonnet fails to autodetect a language if there are no trigrams available for the language. In some cases, the script of the language is detected exactly but only because no trigrams were found for the language in question, autodetection fails.

Additionally, I have tested and patched it in QOwnNotes and it is working pretty well there.

Diff Detail

Repository
R246 Sonnet
Branch
fix-176537 (branched from master)
Lint
No Linters Available
Unit
No Unit Test Coverage
Build Status
Buildable 19139
Build 19157: arc lint + arc unit
waqar created this revision.Nov 23 2019, 6:22 PM
Restricted Application added a project: Frameworks. · View Herald TranscriptNov 23 2019, 6:22 PM
Restricted Application added a subscriber: kde-frameworks-devel. · View Herald Transcript
waqar requested review of this revision.Nov 23 2019, 6:22 PM
waqar edited the summary of this revision. (Show Details)Nov 23 2019, 6:25 PM
waqar updated this revision to Diff 70238.Nov 23 2019, 6:48 PM
waqar edited the summary of this revision. (Show Details)

Fix Sonnet autodetect completely failing when no trigrams found

waqar edited the summary of this revision. (Show Details)Nov 24 2019, 7:47 AM
waqar updated this revision to Diff 70262.Nov 24 2019, 3:11 PM

Revert a change, as autodetection fails for languages with similar scripts

waqar updated this revision to Diff 70288.Nov 25 2019, 1:47 PM

Add test for autodetection

waqar updated this revision to Diff 70289.Nov 25 2019, 1:50 PM

Add Test for autodetection

waqar updated this revision to Diff 70290.Nov 25 2019, 1:54 PM

Update copyright

waqar updated this revision to Diff 70291.Nov 25 2019, 1:58 PM

Remove unused includes

ognarb added a subscriber: ognarb.

Adding a potential reviewer :)

You are right that in GuessLanguage::identify(const QString &text, const QStringList &suggestionsListIn), if a language without trigrams is not present in 'suggestionsListIn', it will never be 'guessed'.

I'd suggest to move your changes to GuessLanguage::identify(const QString &text, const QStringList &suggestionsListIn) after the call to d->identify(text, d->findRuns(text)); but only add those languages for which there is a dictionary AND which don't have trigrams.

There is also a bug in GuessLanguagePrivate::guessFromTrigrams(const QString &sample, const QStringList &languages): if m_minConfidence is left to its default value of '0', that function will always return an empty list. I will propose a fix shortly.

The real issue behind Bug 176537 is a different one, however. On-the-fly spell checking in Kate(Part) will only check one line at a time, potentially not providing enough text for a meaningful language detection. I plan to perform the language detection inside KatePart, so that there is also feedback regading the detected language that is shown to the user, who can then also override the detected language, if desired.

waqar added a comment.EditedJan 1 2020, 3:58 PM

Hi,
First of all thanks for reviewing.

I'd suggest to move your changes to GuessLanguage::identify(const QString &text, const QStringList &suggestionsListIn) after the call to d->identify(text, d->findRuns(text));

Okay. I will do that, but I will have to move the d->findRuns(text) out of the function call.

but only add those languages for which there is a dictionary

I think that will not be an issue because s_scriptLanguages only has the languages for which there are dictionaries. So just to make my point clear, for example if you don't have 'English' dictionary installed, sonnet will never be able to guess the language of the text as 'English'.

The resulting changes look like this:

 //get the scripts for current text
 auto scriptsList = d->findRuns(text);

//try guessing from trigrams
 QStringList candidateLanguages = d->identify(text, scriptsList);

 if (candidateLanguages.isEmpty() && !scriptsList.isEmpty()) {
     for (const QChar::Script script : scriptsList) {
         const auto languagesList = d->s_scriptLanguages.values(script);

         for (const auto &lang : languagesList) {
          //if trigrams don't have this language then add it to the candidates
             if (!d->s_knownModels.contains(lang))
                 candidateLanguages.append(lang);
         }
     }
 }

There is also a bug in GuessLanguagePrivate::guessFromTrigrams(const QString &sample, const QStringList &languages): if m_minConfidence is left to its default value of '0', that function will always return an empty list. I will propose a fix shortly.

Alright, I am excited to hear.

The real issue behind Bug 176537 is a different one, however. On-the-fly spell checking in Kate(Part) will only check one line at a time, potentially not providing enough text for a meaningful language detection.

To be honest, I haven't ever had an issue with that. I mostly test on QOwnNotes, and spellchecking works the same way there i.e., one line at a time. If there is a dictionary present, sonnet will guess the language correctly most of the times. But you are right in that,..more text would enable sonnet to be more accurate. However, autodetection works on a sentence basis, and sentences can sometimes be quite short.

I plan to perform the language detection inside KatePart, so that there is also feedback regading the detected language that is shown to the user, who can then also override the detected language, if desired.

That would be really cool!
I guess the rest of the dictionaries (of the same script) can be shown in the context menu to allow the user to override the detected language.

waqar updated this revision to Diff 72545.Jan 1 2020, 4:26 PM

Apply requested changes

Thanks for the updated patch!

There is also a bug in GuessLanguagePrivate::guessFromTrigrams(const QString &sample, const QStringList &languages): if m_minConfidence is left to its default value of '0', that function will always return an empty list. I will propose a fix shortly.

Alright, I am excited to hear.

Here is my proposal:

https://phabricator.kde.org/D26346

The real issue behind Bug 176537 is a different one, however. On-the-fly spell checking in Kate(Part) will only check one line at a time, potentially not providing enough text for a meaningful language detection.

To be honest, I haven't ever had an issue with that. I mostly test on QOwnNotes, and spellchecking works the same way there i.e., one line at a time. If there is a dictionary present, sonnet will guess the language correctly most of the times. But you are right in that,..more text would enable sonnet to be more accurate. However, autodetection works on a sentence basis, and sentences can sometimes be quite short.

With your patch, language detection works much better in KatePart. However, due to way Sonnet is used in KatePart for on-the-fly spell checking, one issue still is that every word is basically checked against every dictionary. So, in the sentence "English is an interesting langage", "langage" may not be considered to be misspelled since "langage" is a correctly spelled French word, for example. I will still work on that.

src/core/guesslanguage.cpp
589

Couldn't this if-statement can be dropped? I guess one can argue that sometimes there may be a language without trigrams that would even be a better language guess?

waqar updated this revision to Diff 72684.Jan 3 2020, 1:22 PM

Apply requested changes

waqar added inline comments.Jan 3 2020, 4:27 PM
src/core/guesslanguage.cpp
589

Yeah, I think so too

waqar marked an inline comment as done.Jan 3 2020, 4:28 PM
waqar marked an inline comment as done.
apol added a subscriber: apol.Feb 9 2020, 11:34 PM

This patch makes it do all the languages identified by identify + all the languages supported by the script. This could mean lots of languages for some scripts...

src/core/guesslanguage.cpp
586

Shouldn't identify be taking care of the scripts already?

waqar added inline comments.Feb 10 2020, 5:23 AM
src/core/guesslanguage.cpp
586

identify() fails if a certain language is not present in the trigrams.

waqar added a comment.Feb 10 2020, 5:36 AM
In D25495#608582, @apol wrote:

This patch makes it do all the languages identified by identify + all the languages supported by the script. This could mean lots of languages for some scripts...

s_scriptLanguages only contains languages for which there are 'dictionaries' installed/present on the system. So lots of languages for a script could only happen if someone has installed a lot of dictionaries related to that script and all those languages are not part of the trigrams.

This patch only targets the languages which are not present in the trigrams.

apol added inline comments.Mar 24 2020, 2:48 PM
src/core/guesslanguage.cpp
586

Why did you handle it here rather than in identify then? is it a problem doing it there? I see that identify is being used elsewhere too. It will be wrong there.

waqar added inline comments.Mar 24 2020, 2:52 PM
src/core/guesslanguage.cpp
586

Initially I handled it in identify () but then @mludwig suggested that I do it here instead

apol accepted this revision.Mar 24 2020, 2:53 PM

Whatever, let's go with this and we can learn along the way.

Sorry it took so long.

This revision is now accepted and ready to land.Mar 24 2020, 2:53 PM
waqar added a comment.Mar 24 2020, 2:55 PM
In D25495#633551, @apol wrote:

Whatever, let's go with this and we can learn along the way.

Sorry it took so long.

I am excited.
Thanks!

I will be sure to keep checking in if anything goes amiss in Sonnet and will try to fix it.

apol added a comment.Mar 24 2020, 3:03 PM

Can you land the patch or you need us to?

waqar added a comment.Mar 24 2020, 3:17 PM

I tried but I don't have the access.

So I guess you guys would have land.

This revision was automatically updated to reflect the committed changes.
dfaure added inline comments.Mar 24 2020, 9:26 PM
autotests/test_autodetect.cpp
40

Is there an explicit way to check whether a dictionary is installed?
The test row should be skipped, if not.

autotests/test_autodetect.cpp
40

On it.

I will fix and send the patch

waqar added a comment.Mar 29 2020, 1:52 PM

Please have a look and submit fixes.

Patch sent at: D28406