develooper Front page | perl.perl5.porters | Postings from February 2007

Re: unicode regex performance

Marvin Humphrey
February 11, 2007 17:04
Re: unicode regex performance
Message ID:

On Feb 8, 2007, at 4:05 AM, demerphq wrote:
> I welcome some bright spark recoding our unicode charclass handling to
> use inversion lists and getting us up to full level 2 compliance for
> unicode char class set operations.


A lot of the material in this presentation is covered over two  
chapters of Richard Gillam's excellent book, Unicode Demystified:  
<> (Link to  Chapter 13  
"Techniques and Data Structures for Handling Unicode Text" and  
chapter 15 "Searching and Sorting" can basically serve as a howto for  
anyone who feels like scratching the Unicode-regex itch.  Chapter 13  
has a subsection "Testing for membership in a class", which has a sub- 
subsection "Inversion lists".

I wish I had time to work on this right now.  My main project is a  
search engine library, and the index-time bottleneck appears to be  
Perl's UTF-8 character-class regex implementation.

   Mean time to index 1000 ASCII news articles
   UTF-8 regex tokenizer:            4.73 secs
   Latin-1 regex tokenizer:          3.04 secs
   Purpose-built C tokenizer:        1.86 secs

I foresee a time when it will make more sense for me to hack on the  
regex engine than on my own library.  Since it seems unlikely that  
someone will address the charclass issue prior to the release of  
5.10, would it be possible to, say, break out Perl's regex engine,  
update it, fix up a binding using the new regex engine hooks, and  
release a module to CPAN?

Marvin Humphrey
Rectangular Research Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at | Group listing | About