Highlighting Indexer: list of suggestions
Needs ReviewPublic

Authored by jpoelen on Feb 18 2018, 5:00 AM.

Details

Reviewers
dhaumann
Summary

Proposes mergers of rules and the replacement of:

  • RegExp by StringDetect, AnyChar, RangeDetect, etc.
  • StringDetect by DetectChar or Detect2Chars.

Diff Detail

Repository
R216 Syntax Highlighting
Branch
suggest
Lint
No Linters Available
Unit
No Unit Test Coverage
jpoelen created this revision.Feb 18 2018, 5:00 AM
Restricted Application added a project: Frameworks. · View Herald TranscriptFeb 18 2018, 5:00 AM
Restricted Application added a subscriber: Frameworks. · View Herald Transcript
jpoelen requested review of this revision.Feb 18 2018, 5:00 AM

As quick comment: I like the idea of programmatically suggesting better usage of rules. I did not yet fine the time to look into this, so I do not know how much output this currently generates.

In any case, for the highlighting files we ship with the syntax highlighting framework, I would like to have zero error/warning output generated by the highlighting indexer (just like we have zero compiler warnings). So if this generates a lot of suggestions, would you also agree to fix the output? This could possibly touch many many files and be a lot of work :-)

dhaumann requested changes to this revision.Mar 10 2018, 8:35 PM

Btw, current generated list finds 1658 issues: https://paste.kde.org/p7iareuxc

Not all of them are correct, for instance

zsh.xml" line 919 RegExpr candidate for  "Detect2Chars" : "%1"

I believe %1 in this case is a backreference in a dynamic context, so a change to Detect2Chars would be wrong.

Similarly, the suggestion

html-php.xml" line 223 RegExpr candidate for  "AnyChar" : "[^/><\"'\\s]"

is also not correct, since [^xyz] means match everything _except_ xyz, so the ^ symbol negates this, afaik (or am I remembering incorrectly?)...

Still, there are many valid items which certainly can be optimized. So patches for fixes welcome! In small chunks, if possible!

This revision now requires changes to proceed.Mar 10 2018, 8:35 PM

Just for info: https://kate-editor.org/2018/03/10/improving-syntax-highlighting-files/ tries to reach a broader audience for getting patches.

Very nice idea! RegExp rules are the by far biggest cost factor, so every single one we can get rid of is good :)

I've already planned to fix all the suggestions, but I lost a lot of time updating the sql*.xml files and came across a very strange bug related to QRegularExpression by adding some tests. At the moment, I do not know if it's my tool to test, my version of Qt that is old or a real bug. Like last week my hard drive is dead, I took the opportunity to have a more recent configuration. I will make a few patches this weekend when I finished setting my new environment.

The tool: https://github.com/jonathanpoelen/vt-kate-syntax-highlighter with the -n option.
I also have a tool to generate a graph of the rules of an xml: https://github.com/jonathanpoelen/syntax-highlighting/tree/tools (graph.lua). The output is in a format readable by graphviz. The script is rudimentary and the xml parser does not work with some files (2 or 3).

@jpoelen What would be interesting is to check which optimizations are really an improvement. Because we should either get a significant speed boost (e.g. RegExpr -> WordDetect), or at least reduce memory allocations (possibly StringDetect -> Detect2Chars and DetectChar).

Did you do some testing here ?

I just tried with a format containing 5000 <RegExpr attribute="Keyword" context="\#stay" String="\baa\b"/> rules. This gives 2000 to 6000 more allocation than with WordDetect. Same for StringDetect vs Detect2Chars.

At the speed level, for a single rule and a file with 13 * 8000 aa , WordDetect is 10% faster, but there is no difference between StringDetect and Detect2Chars. On the other hand, the more the number of rule increases the more the difference is important, even between StringDetect and Detect2Chars. 5000 times the same rule is extremely slow.

The memory test is done with XDG_DATA_DIRS=$PWD memusage kate-syntax-highlighter x.aa >/dev/null
Speed one with XDG_DATA_DIRS=$PWD /usr/bin/time --format="%Es - %MK" kate-syntax-highlighter x.aa >/dev/null

$PWD/org.kde.syntax-highlighting/syntax/aa.xml:

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE language SYSTEM "language.dtd">
<language name="AAAA" section="Configuration" extensions="*.aa" mimetype="" version="3" kateversion="2.4" author="jo" license="LGPL">
<highlighting>
<contexts>
 <context name="ini" attribute="Normal Text" lineEndContext="#stay">
   <StringDetect attribute="Keyword" context="#stay" String="aa" />
   <!-- <Detect2Chars attribute="Keyword" context="#stay" char="a" char1="a" /> -->
   <!-- <WordDetect attribute="Keyword" context="#stay" String="aa" /> -->
   <!-- <RegExpr attribute="Keyword" context="#stay" String="\baa\b" /> -->
 </context>
</contexts>
<itemDatas>
 <itemData name="Normal Text" defStyleNum="dsDataType" />
 <itemData name="Keyword" defStyleNum="dsKeyword" />
</itemDatas>
</highlighting>
</language>
jpoelen updated this revision to Diff 29869.EditedMar 18 2018, 11:20 PM

More suggestions and fixes some false positives.

StringDetect supports the dynamic attribute, normally there is no problem with "% 1".

jpoelen updated this revision to Diff 30448.Mar 24 2018, 10:54 PM

New suggestions:

  • Regex "^xyz\b" -> WordDetect with column=0
  • Regex "^xyz" with xyz a string of more than 2 characters -> StringDetext with column=0

Output:

  • Double quote are no longer escaped
  • "^\\s*" is no longer deleted

What do we do with this?

I think some of the checks are a bit too aggressive, such as folding multiple regexps into a single one: This makes the regexps much less readable and therefore hard to maintain.

@jpoelen Shall we abandon this, or do you still would like to see some parts of this committed?

Restricted Application added a project: Kate. · View Herald TranscriptJan 8 2019, 7:23 PM
Restricted Application edited subscribers, added: kde-frameworks-devel, kwrite-devel; removed: Frameworks. · View Herald Transcript

I agree that some rules are excessive or even false, but the last time I watched (it was several months ago now), some regexes suggested to be DetectString seemed to have writing errors (mainly \\ count as 2 characters in xml) and some more or less useful propositions. If there has been no change at this level, it should still be checked.

Actually, I came to the conclusion that the parser himself could make some changes on the fly. For example, DetectChar/Detect2Chars are specializations of DetectString and the concatenation of regexes could be automatic, which would make most checks obsolete. There would be only suggestions to turn a Regex into something else, but it is much easier using an AST rather than a very slobbery regexes.

Finally, this commit will never make a complete list without false positives, especially since code reviews filter out such errors. We can abandon it.

PS: this commit uses attrToBool as for the parser (if it has not changed) and adds 2 missing return false. I do not have time to take care of it at the moment.