[Kuit] Port QRegExp to QRegularExpression, third pass
ClosedPublic

Authored by ahmadsamir on Jan 2 2020, 2:09 PM.

Details

Summary

Port determineIsStructured() to QRegularExpression.

Port parseUiMarker() to QRegularExpression:

  • call toLower() once on the context_ parameter, that should make all the strings lower case in one go
  • remove redundant trimmed() calls, the original code (before this patch) chopped the context_ string right before the first whitespace "\s", i.e. the code expected that there are no whitespaces inside "@role:cue/format"
Test Plan

make && ctest

Diff Detail

Repository
R249 KI18n
Lint
Automatic diff as part of commit; lint not applicable.
Unit
Automatic diff as part of commit; unit tests not applicable.
ahmadsamir created this revision.Jan 2 2020, 2:09 PM
Restricted Application added a project: Frameworks. · View Herald TranscriptJan 2 2020, 2:09 PM
ahmadsamir requested review of this revision.Jan 2 2020, 2:09 PM
dfaure requested changes to this revision.Jan 3 2020, 8:33 AM
dfaure added inline comments.
src/kuitmarkup.cpp
91–96

Interesting idea, this simplifies the code.

However the name "wsRx" no longer matches what this regexp does.

Also, the old trimming is missing here, isn't it?
I guess \s*:?\s* is needed, and \s*/?\s*

Can you look into adding unittests for this stuff? Although I'm not sure if that's possible.

This revision now requires changes to proceed.Jan 3 2020, 8:33 AM
ahmadsamir added inline comments.Jan 3 2020, 2:04 PM
src/kuitmarkup.cpp
91–96

Ah, yes, ws is white-space, I haven't figured it out except now :), I'll change it.

The code first trimmed the string, so any leading white space is gone; the original code didn't expect any white-space inside "@role:cue/format", because the code cut the string off right before the first occurrence of \s:

context = context.mid(1, wsRx.indexIn(context) - 1);

so there can't be any white space in the context part. Also looking at the guide, https://api.kde.org/frameworks/ki18n/html/prg_guide.html#good_ctxt , that's how usage is demonstrated, e.g. "@info:whatsthis/plain".

I did look at the unit test, but i18nc() calls don't seem to fail on the context part, it just consumes it. I'll see if I can call parseUiMarker() directly and feed it some good and some bad strings.

ahmadsamir updated this revision to Diff 72709.Jan 3 2020, 7:45 PM

Use a meaningful name for the QRegularExpression object

Explain the (my) logic behind the changes in parseUiMarker() in the commit message

ahmadsamir updated this revision to Diff 72710.Jan 3 2020, 7:45 PM
ahmadsamir edited the summary of this revision. (Show Details)

Verbatim

About the unit test, I couldn't think of a way to test it, unless I use a snippet to check the regex (like you did in D26332), which doesn't seem to fit in a unit test for KLocalizedString...

See D26408 for a unittest. But maybe the code can just be killed...

See D26408 for a unittest. But maybe the code can just be killed...

My 2p, keep it as it looks like a way to override the default format(s) for whatever purposes; and it's useful to test the regex in parseUiMarker, which is used by xi18nc() which is used a lot in KDE code for translations all over the place.

dfaure added a comment.Jan 4 2020, 9:34 AM

You just made me realize something.
The setter is unused, but the getter is used in all cases, to lookup the builtin formats.

So I just extended the unittests in https://commits.kde.org/ki18n/4f9d907fb8e8dba33aa80926d3b36ba12f30da04
These tests do check the regexp in parseUiMarker, please rebase on top of them and check that they still pass.
And you're right, "@info: tooltip" (with a space in the middle) wasn't parsed as a cue, so no problem.

ahmadsamir updated this revision to Diff 72748.EditedJan 4 2020, 3:00 PM

Rebase after the unit tests commits.

All unit tests still pass.

dfaure accepted this revision.Jan 4 2020, 4:57 PM
This revision is now accepted and ready to land.Jan 4 2020, 4:57 PM
This revision was automatically updated to reflect the committed changes.
ilic added inline comments.Jan 5 2020, 5:13 PM
src/kuitmarkup.cpp
91–96

I recall I (or on someone's suggestion) didn't want to use regexes too much for performance reasons. In the original code, other than this whitespace regex, all other regexes appear in places where things went wrong and some heuristics was needed to reduce garbage in output.

I hacked up some (rather crude, to say the least) benchmarking:

void KLocalizedStringTest::benchmarkRegexSimple()
{
    QString roleName, cueName, formatName;
    QString context = QStringLiteral("@info:tooltip/plaintext");

    QBENCHMARK {
        if (context.startsWith(QLatin1Char('@'))) { // found UI marker
            static QRegularExpression wsRx(QStringLiteral("\\s"));
            context = context.mid(1, wsRx.match(context).capturedStart(0) - 1);

            // Possible format.
            int pfmt = context.indexOf(QLatin1Char('/'));
            if (pfmt >= 0) {
                formatName = context.mid(pfmt + 1);
                context.truncate(pfmt);
            }

            // Possible subcue.
            int pcue = context.indexOf(QLatin1Char(':'));
            if (pcue >= 0) {
                cueName = context.mid(pcue + 1);
                context.truncate(pcue);
            }

            // Role.
            roleName = context;
        }
    }
}

void KLocalizedStringTest::benchmarkRegex2()
{
    QString roleName, cueName, formatName;
    QString context = QStringLiteral("@info:tooltip/plaintext");

    QBENCHMARK {
        static const QRegularExpression rolesRx(QStringLiteral("^@(\\w+):?(\\w*)/?(\\w*)"));
        const QRegularExpressionMatch match = rolesRx.match(context);
        if (match.hasMatch()) {
            roleName = match.captured(1);
            cueName = match.captured(2);
            formatName = match.captured(3);
        }
    }
}

and the results were:

PASS   : KLocalizedStringTest::benchmarkRegexSimple()
RESULT : KLocalizedStringTest::benchmarkRegexSimple():
     0.0000098 msecs per iteration (total: 83, iterations: 8388608)
PASS   : KLocalizedStringTest::benchmarkRegex2()
RESULT : KLocalizedStringTest::benchmarkRegex2():
     0.00054 msecs per iteration (total: 72, iterations: 131072)

it doesn't look like a lot of difference to me; WDYT?

dfaure added a comment.Jan 6 2020, 9:11 AM

Benchmarking is complex ;)

  1. you need to make sure both Qt and your benchmark are built with optimizations enabled (-O2)
  2. you need to actually use those 3 variables otherwise the compiler might optimize out their use. This could be testing their value (NOT with Q_ASSERT if you're in release mode!) and aborting if they don't have the expected value. They will, of course, but the compiler won't know that.

Your current results indicate "54 times slower" ! But indeed on very small numbers...

This comment was removed by ahmadsamir.

Benchmarking is complex ;)

  1. you need to make sure both Qt and your benchmark are built with optimizations enabled (-O2)
  2. you need to actually use those 3 variables otherwise the compiler might optimize out their use. This could be testing their value (NOT with Q_ASSERT if you're in release mode!) and aborting if they don't have the expected value. They will, of course, but the compiler won't know that.

    Your current results indicate "54 times slower" ! But indeed on very small numbers...

Qt was built with -O2, Tumbleweed default. But my ki18n build was with -O0, changed to -O2 for this test. Added QCOMPARE() calls for all 3 vars:

PASS   : KLocalizedStringTest::benchmarkRegexSimple()
RESULT : KLocalizedStringTest::benchmarkRegexSimple():
     0.0000056 msecs per iteration (total: 94, iterations: 16777216)
PASS   : KLocalizedStringTest::benchmarkRegex2()
RESULT : KLocalizedStringTest::benchmarkRegex2():
     0.00043 msecs per iteration (total: 57, iterations: 131072)

So, the old way was 76 times faster than the new regexp :-)

I'm not surprised, though, it's consistent with my experience with regexps.

This might be a good reason to use the manual-search way. Especially now that you tested it for both performance and correctness :-)

I'm also wondering if your regexp is completely correct. :?(\w*)/?(\w*) doesn't enforce that one word must be after ':' and one word must be after '/', since the 4 things are optional in an unrelated manner. Admittedly because of \w I can't come up with a string that would be misparsed, so maybe I'm wrong about this. There are ways to express this more strictly in regexp language, but it will only make it more complicated and likely slower :-)

So, the old way was 76 times faster than the new regexp :-)

I'm not surprised, though, it's consistent with my experience with regexps.

This might be a good reason to use the manual-search way. Especially now that you tested it for both performance and correctness :-)

Less than one msec either way; so yeah, faster if one msec is even perceivable at all. :)

I'm also wondering if your regexp is completely correct. :?(\w*)/?(\w*) doesn't enforce that one word must be after ':' and one word must be after '/', since the 4 things are optional in an unrelated manner. Admittedly because of \w I can't come up with a string that would be misparsed, so maybe I'm wrong about this. There are ways to express this more strictly in regexp language, but it will only make it more complicated and likely slower :-)

Hmm, it must be one word as there are no spaces allowed, and "*" is matching greedily, it'll keep matching \w all the way if there no spaces. A problem I see is it'll match @info:/plaintext, which probably won't work as "format" probably requires both a "role" and a "cue".

What's making it easier to match with a regex here is that the input is "controlled", @foo:bar/baz, and written by developers who have to check the docs for the available values, and results are always checked by translators who'll complain if they get mal-formatted translation contexts.

Anyway, I'll revert it in a separate patch; one issue arises here though, it looks like "\s" was over-kill in the original regex, given that only a literal white-space is what's expected, 99% percent of the times, to exist in a "context", i.e. not \t, \n or \r, we can just use .indexOf(QLatin1Char(' ')). I'll create the diff and publish it for review shortly.

dfaure added a comment.Jan 7 2020, 8:06 AM

faster if one msec is even perceivable at all.

It's not, but if an application has 200 strings to translate, then there's a 0.2s difference. That starts to be perceivable.