On Thu, Feb 08, 2007 at 04:57:20PM +0100, Juerd Waalboer wrote: > Which was part of my proposal: upgrade both the string and the pattern > to UTF8 (if necessary), and then do naive byte matching. This should be > explicitly enabled, because it causes havoc if you're not aware of the > internals. Optimizations like this are very nice to have, but should > only be used in extreme cases. Any use of such an optimization (unless > it can safely be done automatically) is probably premature. FYI: Juerd. I agree with you. In glib, I have written C code that searches for characters such as '\n' using the utf-8 functions to walk through the string. Glib does a good job of optimizing the walk through process. They use a lookup table to determine the next length of the next character in bytes and so on. But, however you work it, if you can know that the utf-8 is well formed, searching for a '\n' is faster by searching for the byte 0x0A, then walking through a variable number of bytes at a time. Depending on the regular expression, at least the *tests* should often be possible to do without treating the string as utf-8. A single post-process walk on successful match to ensure that pos(), and $n are all correct might be required. A bit of hand waving. I'm not going to open up the perl RE engine to work on it. I think it would scare me. :-) Cheers, mark -- mark@mielke.cc / markm@ncf.ca / markm@nortel.com __________________________ . . _ ._ . . .__ . . ._. .__ . . . .__ | Neighbourhood Coder |\/| |_| |_| |/ |_ |\/| | |_ | |/ |_ | | | | | | \ | \ |__ . | | .|. |__ |__ | \ |__ | Ottawa, Ontario, Canada One ring to rule them all, one ring to find them, one ring to bring them all and in the darkness bind them... http://mark.mielke.cc/