develooper Front page | perl.module-authors | Postings from November 2003


Thread Next
Andrew C. Flerchinger
November 21, 2003 02:05
Message ID:
I somehow missed it in the last dozen times I've searched for a similar 
module, but it looks like Sherzod Ruzmetov's Parse::Syntax is designed to 
do just what I've done. Luckily, it's listed as umimplemented, so my 
efforts aren't as wasted as they could have been. I've been calling mine 
Syntax::Highlight locally, in the spirit of Syntax::Highlight::Perl (which 
is the only module in the Syntax:: namespace). HTML::SyntaxHighlighter also 
exists (which is a horrible name), and I saw Log::Colorize discussed here 
this past June, but those were the only ones.

The module is a customizable and extensible language-neutral syntax 
highlighter. To get the syntax for a particular language, it uses grammar 
files from EditPlus as of now, but only supports a subset of its features. 
That can be changed to whatever, as I haven't solidified any license 
details on using the grammars. Parsing a grammar file is pretty trivial in 
the grand scheme of things, so a change would be pretty quick if neccessary.

First, for those interested, a demonstration:
and a current code listing:

One major limitation of the method I use to parse right now is that 
delimiters can't be in keywords. That basically means highlighting markup 
languages like HTML where < / > are all delimiters as well as commonly part 
of keywords work a little strange. 'a' is considered a keyword for an 
anchor tag, so in <a href=... tags, the 'a' is highlighted, but also every 
bareword 'a' in the entire document, many (most) of which won't be anchor 
tags. For this reason, though I haven't checked compatibility, I think I'm 
going to see about outsourcing any HTML markup to HTML::SyntaxHighlighter. 
This brings up the subclassing issue and just using this module as a 
generic interface to language-specific syntax parsers...but if I start on 
that, this E-mail is going to be a lot longer than it needs to be right now.

I have two related issues. First, since it has a supporting data file 
that's required to run, how it that distributed? Does it get installed 
somewhere with the module itself and the installation can alter the 
module's code to refer to the installed location? Or do I just include the 
file in the distribution as an example input and force the user to put it 
somewhere and reference it with the module's runtime config?

Second, I've been doing a lot of benchmarking on the highlighter itself, 
but scanning and loading from a big composite grammar file to get the 
language syntax before the highlighter even starts is now the long spot. 
35% of the total highlighting time is spent reading the grammar for a 600 
line file. Considering a module like this would be best used in 
programmers' forums, the amount of code to highlight would be significantly 
less, pushing the grammar parse percentage even higher. Since the grammars 
aren't going to change much, there's really no reason to parse it every 
time to highlight. In the module, each language's grammar is just a 
hash-based data structure. With all the good serializers available, they 
could just be dumped to a file with Storable at worst, or inserted into a 
BLOB-type field in a database at best.

I guess at this point I'm thinking just depend on the user to supply the 
grammar data at runtime, but give them options on how to supply it (in the 
future). I'll hold off on grammar caching for now, as I don't suspect this 
is going to be in time-sensitive places very soon. Granted, it's not slow. 
I've spend substantial time and effort benchmarking, profiling, and 
optimizing. For example, my P3-450 running FreeBSD highlights the 7125 line in just over seven seconds, my XP2000 running Win2k does it 3.2s.

Still, I'd love to hear some ideas on how to best handle caching, target 
namespace, or any other thoughts/reactions to this grand scheme.

Thanks for reading,

PS - I'd like to thank the authors of Devel::ptkdb, Devel::Profile, and 
Benchmark::Timer for making my life easier.

Thread Next Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at | Group listing | About