Details

Reviewers

ilic
dfaure
mlaurent
aacid

Group Reviewers

Frameworks

Commits

R249:a9f4cbdda790: [Kuit] Port QRegExp to QRegularExpression, third pass

Summary

Port determineIsStructured() to QRegularExpression.

Port parseUiMarker() to QRegularExpression:

call toLower() once on the context_ parameter, that should make all the strings lower case in one go
remove redundant trimmed() calls, the original code (before this patch) chopped the context_ string right before the first whitespace "\s", i.e. the code expected that there are no whitespaces inside "@role:cue/format"

Test Plan

make && ctest

Diff Detail

Repository

R249 KI18n

Lint

Automatic diff as part of commit; lint not applicable.

Unit

Automatic diff as part of commit; unit tests not applicable.

ahmadsamir created this revision.Jan 2 2020, 2:09 PM

Restricted Application added a project: Frameworks. · View Herald TranscriptJan 2 2020, 2:09 PM

ahmadsamir requested review of this revision.Jan 2 2020, 2:09 PM

Harbormaster completed remote builds in B20535: Diff 72616.Jan 2 2020, 2:09 PM

dfaure requested changes to this revision.Jan 3 2020, 8:33 AM

dfaure added inline comments.

src/kuitmarkup.cpp
91–96	Interesting idea, this simplifies the code. However the name "wsRx" no longer matches what this regexp does. Also, the old trimming is missing here, isn't it? I guess \s:?\s is needed, and \s/?\s Can you look into adding unittests for this stuff? Although I'm not sure if that's possible.

This revision now requires changes to proceed.Jan 3 2020, 8:33 AM

ahmadsamir added inline comments.Jan 3 2020, 2:04 PM

src/kuitmarkup.cpp
91–96	Ah, yes, ws is white-space, I haven't figured it out except now :), I'll change it. The code first trimmed the string, so any leading white space is gone; the original code didn't expect any white-space inside "@role:cue/format", because the code cut the string off right before the first occurrence of \s: context = context.mid(1, wsRx.indexIn(context) - 1); so there can't be any white space in the context part. Also looking at the guide, https://api.kde.org/frameworks/ki18n/html/prg_guide.html#good_ctxt , that's how usage is demonstrated, e.g. "@info:whatsthis/plain". I did look at the unit test, but i18nc() calls don't seem to fail on the context part, it just consumes it. I'll see if I can call parseUiMarker() directly and feed it some good and some bad strings.

Use a meaningful name for the QRegularExpression object

Explain the (my) logic behind the changes in parseUiMarker() in the commit message

Harbormaster completed remote builds in B20591: Diff 72709.Jan 3 2020, 7:45 PM

Verbatim

Harbormaster completed remote builds in B20592: Diff 72710.Jan 3 2020, 7:45 PM

About the unit test, I couldn't think of a way to test it, unless I use a snippet to check the regex (like you did in D26332), which doesn't seem to fit in a unit test for KLocalizedString...

dfaure mentioned this in D26408: ki18n: export Kuit::setupForDomain, unittest KUitSetup::setFormatForMarker.Jan 4 2020, 12:39 AM

See D26408 for a unittest. But maybe the code can just be killed...

In D26366#587313, @dfaure wrote:

See D26408 for a unittest. But maybe the code can just be killed...

My 2p, keep it as it looks like a way to override the default format(s) for whatever purposes; and it's useful to test the regex in parseUiMarker, which is used by xi18nc() which is used a lot in KDE code for translations all over the place.

You just made me realize something.
The setter is unused, but the getter is used in all cases, to lookup the builtin formats.

So I just extended the unittests in https://commits.kde.org/ki18n/4f9d907fb8e8dba33aa80926d3b36ba12f30da04
These tests do check the regexp in parseUiMarker, please rebase on top of them and check that they still pass.
And you're right, "@info: tooltip" (with a space in the middle) wasn't parsed as a cue, so no problem.

Rebase after the unit tests commits.

All unit tests still pass.

Harbormaster completed remote builds in B20619: Diff 72748.Jan 4 2020, 3:00 PM

dfaure accepted this revision.Jan 4 2020, 4:57 PM

This revision is now accepted and ready to land.Jan 4 2020, 4:57 PM

Closed by commit R249:a9f4cbdda790: [Kuit] Port QRegExp to QRegularExpression, third pass (authored by ahmadsamir). · Explain WhyJan 4 2020, 6:32 PM

This revision was automatically updated to reflect the committed changes.

ilic added inline comments.Jan 5 2020, 5:13 PM

src/kuitmarkup.cpp
91–96	I recall I (or on someone's suggestion) didn't want to use regexes too much for performance reasons. In the original code, other than this whitespace regex, all other regexes appear in places where things went wrong and some heuristics was needed to reduce garbage in output.

dfaure mentioned this in R249:836908f3b83e: ki18n: export Kuit::setupForDomain, unittest KUitSetup::setFormatForMarker.Jan 5 2020, 5:29 PM

I hacked up some (rather crude, to say the least) benchmarking:

void KLocalizedStringTest::benchmarkRegexSimple()
{
    QString roleName, cueName, formatName;
    QString context = QStringLiteral("@info:tooltip/plaintext");

    QBENCHMARK {
        if (context.startsWith(QLatin1Char('@'))) { // found UI marker
            static QRegularExpression wsRx(QStringLiteral("\\s"));
            context = context.mid(1, wsRx.match(context).capturedStart(0) - 1);

            // Possible format.
            int pfmt = context.indexOf(QLatin1Char('/'));
            if (pfmt >= 0) {
                formatName = context.mid(pfmt + 1);
                context.truncate(pfmt);
            }

            // Possible subcue.
            int pcue = context.indexOf(QLatin1Char(':'));
            if (pcue >= 0) {
                cueName = context.mid(pcue + 1);
                context.truncate(pcue);
            }

            // Role.
            roleName = context;
        }
    }
}

void KLocalizedStringTest::benchmarkRegex2()
{
    QString roleName, cueName, formatName;
    QString context = QStringLiteral("@info:tooltip/plaintext");

    QBENCHMARK {
        static const QRegularExpression rolesRx(QStringLiteral("^@(\\w+):?(\\w*)/?(\\w*)"));
        const QRegularExpressionMatch match = rolesRx.match(context);
        if (match.hasMatch()) {
            roleName = match.captured(1);
            cueName = match.captured(2);
            formatName = match.captured(3);
        }
    }
}

and the results were:

PASS   : KLocalizedStringTest::benchmarkRegexSimple()
RESULT : KLocalizedStringTest::benchmarkRegexSimple():
     0.0000098 msecs per iteration (total: 83, iterations: 8388608)
PASS   : KLocalizedStringTest::benchmarkRegex2()
RESULT : KLocalizedStringTest::benchmarkRegex2():
     0.00054 msecs per iteration (total: 72, iterations: 131072)

it doesn't look like a lot of difference to me; WDYT?

Benchmarking is complex ;)

you need to make sure both Qt and your benchmark are built with optimizations enabled (-O2)
you need to actually use those 3 variables otherwise the compiler might optimize out their use. This could be testing their value (NOT with Q_ASSERT if you're in release mode!) and aborting if they don't have the expected value. They will, of course, but the compiler won't know that.

Your current results indicate "54 times slower" ! But indeed on very small numbers...

ahmadsamir added a comment.Jan 6 2020, 9:57 AM

This comment was removed by ahmadsamir.

In D26366#588416, @dfaure wrote:

Benchmarking is complex ;)

you need to make sure both Qt and your benchmark are built with optimizations enabled (-O2)

you need to actually use those 3 variables otherwise the compiler might optimize out their use. This could be testing their value (NOT with Q_ASSERT if you're in release mode!) and aborting if they don't have the expected value. They will, of course, but the compiler won't know that.

Your current results indicate "54 times slower" ! But indeed on very small numbers...

Qt was built with -O2, Tumbleweed default. But my ki18n build was with -O0, changed to -O2 for this test. Added QCOMPARE() calls for all 3 vars:

PASS   : KLocalizedStringTest::benchmarkRegexSimple()
RESULT : KLocalizedStringTest::benchmarkRegexSimple():
     0.0000056 msecs per iteration (total: 94, iterations: 16777216)
PASS   : KLocalizedStringTest::benchmarkRegex2()
RESULT : KLocalizedStringTest::benchmarkRegex2():
     0.00043 msecs per iteration (total: 57, iterations: 131072)

So, the old way was 76 times faster than the new regexp :-)

I'm not surprised, though, it's consistent with my experience with regexps.

This might be a good reason to use the manual-search way. Especially now that you tested it for both performance and correctness :-)

I'm also wondering if your regexp is completely correct. :?(\w*)/?(\w*) doesn't enforce that one word must be after ':' and one word must be after '/', since the 4 things are optional in an unrelated manner. Admittedly because of \w I can't come up with a string that would be misparsed, so maybe I'm wrong about this. There are ways to express this more strictly in regexp language, but it will only make it more complicated and likely slower :-)

In D26366#589229, @dfaure wrote:

So, the old way was 76 times faster than the new regexp :-)

I'm not surprised, though, it's consistent with my experience with regexps.

This might be a good reason to use the manual-search way. Especially now that you tested it for both performance and correctness :-)

Less than one msec either way; so yeah, faster if one msec is even perceivable at all. :)

I'm also wondering if your regexp is completely correct. :?(\w*)/?(\w*) doesn't enforce that one word must be after ':' and one word must be after '/', since the 4 things are optional in an unrelated manner. Admittedly because of \w I can't come up with a string that would be misparsed, so maybe I'm wrong about this. There are ways to express this more strictly in regexp language, but it will only make it more complicated and likely slower :-)

Hmm, it must be one word as there are no spaces allowed, and "*" is matching greedily, it'll keep matching \w all the way if there no spaces. A problem I see is it'll match @info:/plaintext, which probably won't work as "format" probably requires both a "role" and a "cue".

What's making it easier to match with a regex here is that the input is "controlled", @foo:bar/baz, and written by developers who have to check the docs for the available values, and results are always checked by translators who'll complain if they get mal-formatted translation contexts.

Anyway, I'll revert it in a separate patch; one issue arises here though, it looks like "\s" was over-kill in the original regex, given that only a literal white-space is what's expected, 99% percent of the times, to exist in a "context", i.e. not \t, \n or \r, we can just use .indexOf(QLatin1Char(' ')). I'll create the diff and publish it for review shortly.

faster if one msec is even perceivable at all.

It's not, but if an application has 200 strings to translate, then there's a 0.2s difference. That starts to be perceivable.

ahmadsamir mentioned this in D26487: [Kuit] Revert changes in parseUiMarker().Jan 7 2020, 9:32 AM

ahmadsamir mentioned this in R249:fff18126b973: [Kuit] Revert changes in parseUiMarker().Jan 11 2020, 9:29 AM

Diff	ID	Base	Description	Created	Lint	Unit
Base			Base
Diff 1	72616	f9b932d		Jan 2 2020, 2:09 PM	★	★
Diff 2	72709	f9b932d	Use a meaningful name for the QRegularExpression object	Jan 3 2020, 7:45 PM	★	★
Diff 3	72710	f9b932d	Verbatim	Jan 3 2020, 7:45 PM	★	★
Diff 4	72748	a92305a	Rebase after the unit tests commits	Jan 4 2020, 3:00 PM	★	★
Diff 5	72763	a92305a	R249:a9f4cbdda790c3962635605cee69be882f209e25	Jan 4 2020, 6:32 PM	★	★

[Kuit] Port QRegExp to QRegularExpression, third pass
ClosedPublic
Actions

Details

Diff Detail

Revision Contents
Changeset List

Diff 72763

src/kuitmarkup.cpp

[Kuit] Port QRegExp to QRegularExpression, third passClosedPublicActions

Details

Diff Detail

Revision ContentsChangeset List

Diff 72763

src/kuitmarkup.cpp

[Kuit] Port QRegExp to QRegularExpression, third pass
ClosedPublic
Actions

Revision Contents
Changeset List