Small improvements in some XML files
ClosedPublic

Authored by nibags on Oct 27 2019, 10:14 AM.

Details

Summary

Changes:

  • Replace some unnecessary RegExpr rules with other rules.
  • Use non-capturing groups in some regular expressions.
  • Add ##Modelines in comments.
  • Add folding for multi-line comments, add attribute region in element <comment>.
  • Add column atributte in some rules.
Test Plan

make test

Diff Detail

Repository
R216 Syntax Highlighting
Lint
No Linters Available
Unit
No Unit Test Coverage
Build Status
Buildable 19407
Build 19425: arc lint + arc unit
nibags created this revision.Oct 27 2019, 10:14 AM
Restricted Application added projects: Kate, Frameworks. · View Herald TranscriptOct 27 2019, 10:14 AM
Restricted Application added subscribers: kde-frameworks-devel, kwrite-devel. · View Herald Transcript
nibags requested review of this revision.Oct 27 2019, 10:14 AM
nibags updated this revision to Diff 68824.Oct 27 2019, 10:21 AM
  • Increase version of javascript.xml

I wonder if the ?: optimizations make sense. QRegularExpression has the option QRegularExpression::DontCaptureOption to not capture anything. Looking into our code we have:

561 bool RegExpr::doLoad(QXmlStreamReader& reader)
562 {
563     m_regexp.setPattern(reader.attributes().value(QStringLiteral("String")).toString());                       // here we set the pattern -> OK
564
565     const auto isMinimal = Xml::attrToBool(reader.attributes().value(QStringLiteral("minimal")));
566     const auto isCaseInsensitive = Xml::attrToBool(reader.attributes().value(QStringLiteral("insensitive")));
567     m_regexp.setPatternOptions(                                                                                // if (m_dynamic == false), we could add the
568        (isMinimal ? QRegularExpression::InvertedGreedinessOption : QRegularExpression::NoPatternOption) |      // flag QRegularExpression::DontCaptureOption
569        (isCaseInsensitive ? QRegularExpression::CaseInsensitiveOption : QRegularExpression::NoPatternOption));
570
571    // optimize the pattern for the non-dynamic case, we use them OFTEN
572    m_dynamic = Xml::attrToBool(reader.attributes().value(QStringLiteral("dynamic")));
573    if (!m_dynamic) {
574        m_regexp.optimize();
575    }
576    // [...]

In other words: The current patch adds many ?: which also makes the RegExps harder to read. So: Do we really get a performance gain here? Wouldn't it be possible to get an even better performance gain by using the flag DontCaptureOption?

Currently, I am not yet convinced, can you give this a try? :-)

I had also thought about using QRegularExpression::DontCaptureOption, which is equivalent to using (?:...), but I wasn't sure how much the real improvement in performance is. However, disabling captures avoids allocating unnecessary QString for each capture.

One option would be to add a capture or dontCapture attribute to enable or disable captures for RegExpr rules. Also, captures could be enabled or disabled in all RegExpr rules using the <general> group, adding an element for that.

[...]
One option would be to add a capture or dontCapture attribute to enable or disable captures for RegExpr rules. Also, captures could be enabled or disabled in all RegExpr rules using the <general> group, adding an element for that.

But don't we have this already with the dynamic flag? Or am I mixing things?

No, the dynamic flag is used to insert the captures already stored in %N (the captures are stored in the RegExpr rules and then "inserted" in the rules with the dynamic flag)

Hm right... too bad. I was hoping to find an automated way to detect this. Since relying on the user to optimize the RegExps will always be suboptimal. @cullmann Do you have any ideas?

Could we perhaps just do some minimal checking like "no dynamic stuff at all" => turn off all captures?

I think we should decide what to do with this patch, as over time it will get merge conflicts.

In general, I hoped we can have some sort of auto detection to disable captures in a smart way.

If this is not possible, maybe we should accept this patch?

Did you measure speed improvements?

nibags added a comment.EditedDec 2 2019, 9:08 PM

Captures can also be used without dynamic rules, for example in the regular expression: (&quot;+)[^&quot;]*\1
In these cases, a capture="true" attribute can be used to enable captures, but in dynamic rules the detection will be automatic.

I could do a benchmark of the test units with the disabled captures to check the improvement in performance, and see if these changes are worth making.

I don't know if you want to include this diff. If you prefer, I can update this diff, remove the inclusion of ?: in regex and leave only the other improvements.

nibags updated this revision to Diff 70779.Dec 2 2019, 9:10 PM
  • Resolve merge conflicts.
nibags updated this revision to Diff 70781.Dec 2 2019, 9:21 PM
  • Resolve conflicts
cullmann accepted this revision.Feb 9 2020, 11:39 AM

I think we can go with this as is.
Given we have tests for a lot of stuff, regressions shouldn't be that likely.
I will merge this, before we let that rot even more.
Thanks for the work on this, btw.!

This revision is now accepted and ready to land.Feb 9 2020, 11:39 AM
This revision was automatically updated to reflect the committed changes.