develooper Front page | perl.perl5.porters | Postings from September 2000

Re: unicode support and perl

Thread Previous | Thread Next
From:
Simon Cozens
Date:
September 15, 2000 11:25
Subject:
Re: unicode support and perl
Message ID:
20000915192446.A11604@deep-dark-truthful-mirror.perlhacker.org
On Fri, Sep 15, 2000 at 02:03:09PM -0400, Spider Boardman wrote:
> The real fix for the regexp-related ones is to fix the regexp
> internal opcodes to be polymorphic with respect to utf8-ness.

For clarification, I think what we mean by polymorphic here is this:
(from perlguts)

    You may not skip over UTF8 characters in this case. If you
    do this, you'll lose the ability to match hi-bit non-UTF8 characters;
    for instance, if your UTF8 string contains C<v196.172>, and you skip
    that character, you can never match a C<chr(200)> in a non-UTF8 string.
    So don't do that!

That's to say, if you're trying match "\xc4\xac" inside a UTF8 string, you
should *also* match "\xc8" inside a non-UTF8 string, rather than just matching
UTF8 elements only. (And vice versa - /\xc8/ should match against pack("U*",
196, 172). Or at least, I think it should. I've just realised this isn't that
clear-cut.)

The officially correct way to do this is to use utf8_to_uv on everything UTF8
as perlguts points out, but that gets expensive fast and is probably too
expensive for the regexp engine. Not sure how you want to get around that one.

-- 
An algorithm must be seen to be believed.
		-- D.E. Knuth

Thread Previous | Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About