develooper Front page | perl.perl5.porters | Postings from February 2007

Re: unicode regex performance

From:
demerphq
Date:
February 12, 2007 02:28
Subject:
Re: unicode regex performance
Message ID:
9b18b3110702120228n7554589eg1106edfafc3c0c5e@mail.gmail.com
On 2/12/07, Marvin Humphrey <marvin@rectangular.com> wrote:
>
> On Feb 8, 2007, at 4:05 AM, demerphq wrote:
> > I welcome some bright spark recoding our unicode charclass handling to
> > use inversion lists and getting us up to full level 2 compliance for
> > unicode char class set operations.
>
> > http://macchiato.com/slides/Bits_of_Unicode.ppt
>
> A lot of the material in this presentation is covered over two
> chapters of Richard Gillam's excellent book, Unicode Demystified:
> <http://xrl.us/urko> (Link to www.amazon.com).  Chapter 13
> "Techniques and Data Structures for Handling Unicode Text" and
> chapter 15 "Searching and Sorting" can basically serve as a howto for
> anyone who feels like scratching the Unicode-regex itch.  Chapter 13
> has a subsection "Testing for membership in a class", which has a sub-
> subsection "Inversion lists".

Thanks for the reference. I'llhave to get myself a copy.

Inversion lists as a concept seem fairly straightforward, im mostly
curious about set operations using them, and how to construct them
efficiently.

> I wish I had time to work on this right now.  My main project is a
> search engine library, and the index-time bottleneck appears to be
> Perl's UTF-8 character-class regex implementation.
>
>    Mean time to index 1000 ASCII news articles
>    -------------------------------------------
>    UTF-8 regex tokenizer:            4.73 secs
>    Latin-1 regex tokenizer:          3.04 secs
>    Purpose-built C tokenizer:        1.86 secs

What version of perl are these numbers from?

> I foresee a time when it will make more sense for me to hack on the
> regex engine than on my own library.  Since it seems unlikely that
> someone will address the charclass issue prior to the release of
> 5.10, would it be possible to, say, break out Perl's regex engine,
> update it, fix up a binding using the new regex engine hooks, and
> release a module to CPAN?

Yes, this is already possible with blead and will be possible in 5.10

Yves

-- 
perl -Mre=debug -e "/just|another|perl|hacker/"



nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About