Small improvements in some XML files
ClosedPublic

Authored by nibags on Oct 27 2019, 10:14 AM.

Details

Summary

Changes:

  • Replace some unnecessary RegExpr rules with other rules.
  • Use non-capturing groups in some regular expressions.
  • Add ##Modelines in comments.
  • Add folding for multi-line comments, add attribute region in element <comment>.
  • Add column atributte in some rules.
Test Plan

make test

Diff Detail

Repository
R216 Syntax Highlighting
Lint
Automatic diff as part of commit; lint not applicable.
Unit
Automatic diff as part of commit; unit tests not applicable.
nibags created this revision.Oct 27 2019, 10:14 AM
Restricted Application added projects: Kate, Frameworks. · View Herald TranscriptOct 27 2019, 10:14 AM
Restricted Application added subscribers: kde-frameworks-devel, kwrite-devel. · View Herald Transcript
nibags requested review of this revision.Oct 27 2019, 10:14 AM
nibags updated this revision to Diff 68824.Oct 27 2019, 10:21 AM
  • Increase version of javascript.xml

I wonder if the ?: optimizations make sense. QRegularExpression has the option QRegularExpression::DontCaptureOption to not capture anything. Looking into our code we have:

561 bool RegExpr::doLoad(QXmlStreamReader& reader)
562 {
563     m_regexp.setPattern(reader.attributes().value(QStringLiteral("String")).toString());                       // here we set the pattern -> OK
564
565     const auto isMinimal = Xml::attrToBool(reader.attributes().value(QStringLiteral("minimal")));
566     const auto isCaseInsensitive = Xml::attrToBool(reader.attributes().value(QStringLiteral("insensitive")));
567     m_regexp.setPatternOptions(                                                                                // if (m_dynamic == false), we could add the
568        (isMinimal ? QRegularExpression::InvertedGreedinessOption : QRegularExpression::NoPatternOption) |      // flag QRegularExpression::DontCaptureOption
569        (isCaseInsensitive ? QRegularExpression::CaseInsensitiveOption : QRegularExpression::NoPatternOption));
570
571    // optimize the pattern for the non-dynamic case, we use them OFTEN
572    m_dynamic = Xml::attrToBool(reader.attributes().value(QStringLiteral("dynamic")));
573    if (!m_dynamic) {
574        m_regexp.optimize();
575    }
576    // [...]

In other words: The current patch adds many ?: which also makes the RegExps harder to read. So: Do we really get a performance gain here? Wouldn't it be possible to get an even better performance gain by using the flag DontCaptureOption?

Currently, I am not yet convinced, can you give this a try? :-)

I had also thought about using QRegularExpression::DontCaptureOption, which is equivalent to using (?:...), but I wasn't sure how much the real improvement in performance is. However, disabling captures avoids allocating unnecessary QString for each capture.

One option would be to add a capture or dontCapture attribute to enable or disable captures for RegExpr rules. Also, captures could be enabled or disabled in all RegExpr rules using the <general> group, adding an element for that.

[...]
One option would be to add a capture or dontCapture attribute to enable or disable captures for RegExpr rules. Also, captures could be enabled or disabled in all RegExpr rules using the <general> group, adding an element for that.

But don't we have this already with the dynamic flag? Or am I mixing things?

No, the dynamic flag is used to insert the captures already stored in %N (the captures are stored in the RegExpr rules and then "inserted" in the rules with the dynamic flag)

Hm right... too bad. I was hoping to find an automated way to detect this. Since relying on the user to optimize the RegExps will always be suboptimal. @cullmann Do you have any ideas?

Could we perhaps just do some minimal checking like "no dynamic stuff at all" => turn off all captures?

I think we should decide what to do with this patch, as over time it will get merge conflicts.

In general, I hoped we can have some sort of auto detection to disable captures in a smart way.

If this is not possible, maybe we should accept this patch?

Did you measure speed improvements?

nibags added a comment.EditedDec 2 2019, 9:08 PM

Captures can also be used without dynamic rules, for example in the regular expression: (&quot;+)[^&quot;]*\1
In these cases, a capture="true" attribute can be used to enable captures, but in dynamic rules the detection will be automatic.

I could do a benchmark of the test units with the disabled captures to check the improvement in performance, and see if these changes are worth making.

I don't know if you want to include this diff. If you prefer, I can update this diff, remove the inclusion of ?: in regex and leave only the other improvements.

nibags updated this revision to Diff 70779.Dec 2 2019, 9:10 PM
  • Resolve merge conflicts.
nibags updated this revision to Diff 70781.Dec 2 2019, 9:21 PM
  • Resolve conflicts
cullmann accepted this revision.Feb 9 2020, 11:39 AM

I think we can go with this as is.
Given we have tests for a lot of stuff, regressions shouldn't be that likely.
I will merge this, before we let that rot even more.
Thanks for the work on this, btw.!

This revision is now accepted and ready to land.Feb 9 2020, 11:39 AM
This revision was automatically updated to reflect the committed changes.