On Feb 8, 2007, at 4:05 AM, demerphq wrote: > I welcome some bright spark recoding our unicode charclass handling to > use inversion lists and getting us up to full level 2 compliance for > unicode char class set operations. > http://macchiato.com/slides/Bits_of_Unicode.ppt A lot of the material in this presentation is covered over two chapters of Richard Gillam's excellent book, Unicode Demystified: <http://xrl.us/urko> (Link to www.amazon.com). Chapter 13 "Techniques and Data Structures for Handling Unicode Text" and chapter 15 "Searching and Sorting" can basically serve as a howto for anyone who feels like scratching the Unicode-regex itch. Chapter 13 has a subsection "Testing for membership in a class", which has a sub- subsection "Inversion lists". I wish I had time to work on this right now. My main project is a search engine library, and the index-time bottleneck appears to be Perl's UTF-8 character-class regex implementation. Mean time to index 1000 ASCII news articles ------------------------------------------- UTF-8 regex tokenizer: 4.73 secs Latin-1 regex tokenizer: 3.04 secs Purpose-built C tokenizer: 1.86 secs I foresee a time when it will make more sense for me to hack on the regex engine than on my own library. Since it seems unlikely that someone will address the charclass issue prior to the release of 5.10, would it be possible to, say, break out Perl's regex engine, update it, fix up a binding using the new regex engine hooks, and release a module to CPAN? Marvin Humphrey Rectangular Research http://www.rectangular.com/