On Fri, Sep 15, 2000 at 02:03:09PM -0400, Spider Boardman wrote: > The real fix for the regexp-related ones is to fix the regexp > internal opcodes to be polymorphic with respect to utf8-ness. For clarification, I think what we mean by polymorphic here is this: (from perlguts) You may not skip over UTF8 characters in this case. If you do this, you'll lose the ability to match hi-bit non-UTF8 characters; for instance, if your UTF8 string contains C<v196.172>, and you skip that character, you can never match a C<chr(200)> in a non-UTF8 string. So don't do that! That's to say, if you're trying match "\xc4\xac" inside a UTF8 string, you should *also* match "\xc8" inside a non-UTF8 string, rather than just matching UTF8 elements only. (And vice versa - /\xc8/ should match against pack("U*", 196, 172). Or at least, I think it should. I've just realised this isn't that clear-cut.) The officially correct way to do this is to use utf8_to_uv on everything UTF8 as perlguts points out, but that gets expensive fast and is probably too expensive for the regexp engine. Not sure how you want to get around that one. -- An algorithm must be seen to be believed. -- D.E. KnuthThread Previous | Thread Next