As quick comment: I like the idea of programmatically suggesting better usage of rules. I did not yet fine the time to look into this, so I do not know how much output this currently generates.

In any case, for the highlighting files we ship with the syntax highlighting framework, I would like to have zero error/warning output generated by the highlighting indexer (just like we have zero compiler warnings). So if this generates a lot of suggestions, would you also agree to fix the output? This could possibly touch many many files and be a lot of work :-)

Btw, current generated list finds 1658 issues: https://paste.kde.org/p7iareuxc

Not all of them are correct, for instance

zsh.xml" line 919 RegExpr candidate for  "Detect2Chars" : "%1"

I believe %1 in this case is a backreference in a dynamic context, so a change to Detect2Chars would be wrong.

Similarly, the suggestion

html-php.xml" line 223 RegExpr candidate for  "AnyChar" : "[^/><\"'\\s]"

is also not correct, since [^xyz] means match everything _except_ xyz, so the ^ symbol negates this, afaik (or am I remembering incorrectly?)...

Still, there are many valid items which certainly can be optimized. So patches for fixes welcome! In small chunks, if possible!

This revision now requires changes to proceed.Mar 10 2018, 8:35 PM

Just for info: https://kate-editor.org/2018/03/10/improving-syntax-highlighting-files/ tries to reach a broader audience for getting patches.

Very nice idea! RegExp rules are the by far biggest cost factor, so every single one we can get rid of is good :)

I've already planned to fix all the suggestions, but I lost a lot of time updating the sql*.xml files and came across a very strange bug related to QRegularExpression by adding some tests. At the moment, I do not know if it's my tool to test, my version of Qt that is old or a real bug. Like last week my hard drive is dead, I took the opportunity to have a more recent configuration. I will make a few patches this weekend when I finished setting my new environment.

The tool: https://github.com/jonathanpoelen/vt-kate-syntax-highlighter with the -n option.
I also have a tool to generate a graph of the rules of an xml: https://github.com/jonathanpoelen/syntax-highlighting/tree/tools (graph.lua). The output is in a format readable by graphviz. The script is rudimentary and the xml parser does not work with some files (2 or 3).

dhaumann mentioned this in D11298: Optimize highlighting Bash, Cisco, Clipper, Coffee, Gap, Haml, Haskell.Mar 13 2018, 8:40 PM

@jpoelen What would be interesting is to check which optimizations are really an improvement. Because we should either get a significant speed boost (e.g. RegExpr -> WordDetect), or at least reduce memory allocations (possibly StringDetect -> Detect2Chars and DetectChar).

Did you do some testing here ?

dhaumann mentioned this in R216:cec763566ca4: Optimize highlighting Bash, Cisco, Clipper, Coffee, Gap, Haml, Haskell.Mar 13 2018, 8:44 PM

I just tried with a format containing 5000 <RegExpr attribute="Keyword" context="\#stay" String="\baa\b"/> rules. This gives 2000 to 6000 more allocation than with WordDetect. Same for StringDetect vs Detect2Chars.

At the speed level, for a single rule and a file with 13 * 8000 aa , WordDetect is 10% faster, but there is no difference between StringDetect and Detect2Chars. On the other hand, the more the number of rule increases the more the difference is important, even between StringDetect and Detect2Chars. 5000 times the same rule is extremely slow.

The memory test is done with XDG_DATA_DIRS=$PWD memusage kate-syntax-highlighter x.aa >/dev/null
Speed one with XDG_DATA_DIRS=$PWD /usr/bin/time --format="%Es - %MK" kate-syntax-highlighter x.aa >/dev/null

$PWD/org.kde.syntax-highlighting/syntax/aa.xml:

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE language SYSTEM "language.dtd">
<language name="AAAA" section="Configuration" extensions="*.aa" mimetype="" version="3" kateversion="2.4" author="jo" license="LGPL">
<highlighting>
<contexts>
 <context name="ini" attribute="Normal Text" lineEndContext="#stay">
   <StringDetect attribute="Keyword" context="#stay" String="aa" />
   <!-- <Detect2Chars attribute="Keyword" context="#stay" char="a" char1="a" /> -->
   <!-- <WordDetect attribute="Keyword" context="#stay" String="aa" /> -->
   <!-- <RegExpr attribute="Keyword" context="#stay" String="\baa\b" /> -->
 </context>
</contexts>
<itemDatas>
 <itemData name="Normal Text" defStyleNum="dsDataType" />
 <itemData name="Keyword" defStyleNum="dsKeyword" />
</itemDatas>
</highlighting>
</language>

More suggestions and fixes some false positives.

StringDetect supports the dynamic attribute, normally there is no problem with "% 1".

nibags mentioned this in D11543: Optimize many syntax highlighting files and fix the '/' char of SQL.Mar 21 2018, 10:07 AM

New suggestions:

Regex "^xyz\b" -> WordDetect with column=0
Regex "^xyz" with xyz a string of more than 2 characters -> StringDetext with column=0

Output:

Double quote are no longer escaped
"^\\s*" is no longer deleted

nibags mentioned this in D11945: Improve highlighting of SELinux CIL policies & file contexts.Apr 5 2018, 7:21 AM

cullmann mentioned this in R216:43396e0a9773: Optimize many syntax highlighting files and fix the '/' char of SQL.Aug 14 2018, 3:07 PM

What do we do with this?

I think some of the checks are a bit too aggressive, such as folding multiple regexps into a single one: This makes the regexps much less readable and therefore hard to maintain.

@jpoelen Shall we abandon this, or do you still would like to see some parts of this committed?

Restricted Application added a project: Kate. · View Herald TranscriptJan 8 2019, 7:23 PM

Restricted Application edited subscribers, added: kde-frameworks-devel, kwrite-devel; removed: Frameworks. · View Herald Transcript

I agree that some rules are excessive or even false, but the last time I watched (it was several months ago now), some regexes suggested to be DetectString seemed to have writing errors (mainly \\ count as 2 characters in xml) and some more or less useful propositions. If there has been no change at this level, it should still be checked.

Actually, I came to the conclusion that the parser himself could make some changes on the fly. For example, DetectChar/Detect2Chars are specializations of DetectString and the concatenation of regexes could be automatic, which would make most checks obsolete. There would be only suggestions to turn a Regex into something else, but it is much easier using an AST rather than a very slobbery regexes.

Finally, this commit will never make a complete list without false positives, especially since code reviews filter out such errors. We can abandon it.

PS: this commit uses attrToBool as for the parser (if it has not changed) and adds 2 missing return false. I do not have time to take care of it at the moment.

Revision Contents
Changeset List

			Path	Packages
M			src/indexer/katehighlightingindexer.cpp (267 lines)

Diff	ID	Base	Description	Created	Lint	Unit
Base			Base
Diff 1	27439	0f75a36		Feb 18 2018, 5:00 AM	★	★
Diff 2	29869	0f75a36	More suggestions and fixes some false positives.	Mar 18 2018, 11:20 PM	★	★
Diff 3	30448	0f75a36	New suggestions:	Mar 24 2018, 10:54 PM	★	★

Commit	Tree	Parents	Author	Summary	Date
164d4adeba07	d10fdbd3e567	cc9cfb00a1cd	jonathanpoelen	New suggestions: - Regex "^xyz\b" -> WordDetect with column=0 - Regex "^xyz"… (Show More…)	Mar 24 2018, 10:40 PM
cc9cfb00a1cd	c3cac5637dec	0f75a3659b3c	jonathan poelen	Highlighting Indexer: list of suggestions (Show More…)	Feb 17 2018, 10:07 PM

Diff 30448

View Options

src/indexer/katehighlightingindexer.cpp

Highlighting Indexer: list of suggestionsNeeds ReviewPublicActions

Details

Diff Detail

Revision ContentsChangeset List

Diff 30448

src/indexer/katehighlightingindexer.cpp

Highlighting Indexer: list of suggestions
Needs ReviewPublic
Actions

Revision Contents
Changeset List