Front page | perl.module-authors |
Postings from November 2003
From: Andrew C. Flerchinger
November 21, 2003 02:05
Message ID: email@example.com
I somehow missed it in the last dozen times I've searched for a similar
module, but it looks like Sherzod Ruzmetov's Parse::Syntax is designed to
do just what I've done. Luckily, it's listed as umimplemented, so my
efforts aren't as wasted as they could have been. I've been calling mine
Syntax::Highlight locally, in the spirit of Syntax::Highlight::Perl (which
is the only module in the Syntax:: namespace). HTML::SyntaxHighlighter also
exists (which is a horrible name), and I saw Log::Colorize discussed here
this past June, but those were the only ones.
The module is a customizable and extensible language-neutral syntax
highlighter. To get the syntax for a particular language, it uses grammar
files from EditPlus as of now, but only supports a subset of its features.
That can be changed to whatever, as I haven't solidified any license
details on using the grammars. Parsing a grammar file is pretty trivial in
the grand scheme of things, so a change would be pretty quick if neccessary.
First, for those interested, a demonstration:
and a current code listing: http://lorax.no-ip.com/cgi-bin/highlight.cgi
One major limitation of the method I use to parse right now is that
delimiters can't be in keywords. That basically means highlighting markup
languages like HTML where < / > are all delimiters as well as commonly part
of keywords work a little strange. 'a' is considered a keyword for an
anchor tag, so in <a href=... tags, the 'a' is highlighted, but also every
bareword 'a' in the entire document, many (most) of which won't be anchor
tags. For this reason, though I haven't checked compatibility, I think I'm
going to see about outsourcing any HTML markup to HTML::SyntaxHighlighter.
This brings up the subclassing issue and just using this module as a
generic interface to language-specific syntax parsers...but if I start on
that, this E-mail is going to be a lot longer than it needs to be right now.
I have two related issues. First, since it has a supporting data file
that's required to run, how it that distributed? Does it get installed
somewhere with the module itself and the installation can alter the
module's code to refer to the installed location? Or do I just include the
file in the distribution as an example input and force the user to put it
somewhere and reference it with the module's runtime config?
Second, I've been doing a lot of benchmarking on the highlighter itself,
but scanning and loading from a big composite grammar file to get the
language syntax before the highlighter even starts is now the long spot.
35% of the total highlighting time is spent reading the grammar for a 600
line file. Considering a module like this would be best used in
programmers' forums, the amount of code to highlight would be significantly
less, pushing the grammar parse percentage even higher. Since the grammars
aren't going to change much, there's really no reason to parse it every
time to highlight. In the module, each language's grammar is just a
hash-based data structure. With all the good serializers available, they
could just be dumped to a file with Storable at worst, or inserted into a
BLOB-type field in a database at best.
I guess at this point I'm thinking just depend on the user to supply the
grammar data at runtime, but give them options on how to supply it (in the
future). I'll hold off on grammar caching for now, as I don't suspect this
is going to be in time-sensitive places very soon. Granted, it's not slow.
I've spend substantial time and effort benchmarking, profiling, and
optimizing. For example, my P3-450 running FreeBSD highlights the 7125 line
CGI.pm in just over seven seconds, my XP2000 running Win2k does it 3.2s.
Still, I'd love to hear some ideas on how to best handle caching, target
namespace, or any other thoughts/reactions to this grand scheme.
Thanks for reading,
PS - I'd like to thank the authors of Devel::ptkdb, Devel::Profile, and
Benchmark::Timer for making my life easier.
by Andrew C. Flerchinger