On 2/12/07, Marvin Humphrey <marvin@rectangular.com> wrote: > > On Feb 8, 2007, at 4:05 AM, demerphq wrote: > > I welcome some bright spark recoding our unicode charclass handling to > > use inversion lists and getting us up to full level 2 compliance for > > unicode char class set operations. > > > http://macchiato.com/slides/Bits_of_Unicode.ppt > > A lot of the material in this presentation is covered over two > chapters of Richard Gillam's excellent book, Unicode Demystified: > <http://xrl.us/urko> (Link to www.amazon.com). Chapter 13 > "Techniques and Data Structures for Handling Unicode Text" and > chapter 15 "Searching and Sorting" can basically serve as a howto for > anyone who feels like scratching the Unicode-regex itch. Chapter 13 > has a subsection "Testing for membership in a class", which has a sub- > subsection "Inversion lists". Thanks for the reference. I'llhave to get myself a copy. Inversion lists as a concept seem fairly straightforward, im mostly curious about set operations using them, and how to construct them efficiently. > I wish I had time to work on this right now. My main project is a > search engine library, and the index-time bottleneck appears to be > Perl's UTF-8 character-class regex implementation. > > Mean time to index 1000 ASCII news articles > ------------------------------------------- > UTF-8 regex tokenizer: 4.73 secs > Latin-1 regex tokenizer: 3.04 secs > Purpose-built C tokenizer: 1.86 secs What version of perl are these numbers from? > I foresee a time when it will make more sense for me to hack on the > regex engine than on my own library. Since it seems unlikely that > someone will address the charclass issue prior to the release of > 5.10, would it be possible to, say, break out Perl's regex engine, > update it, fix up a binding using the new regex engine hooks, and > release a module to CPAN? Yes, this is already possible with blead and will be possible in 5.10 Yves -- perl -Mre=debug -e "/just|another|perl|hacker/"