develooper Front page | perl.perl5.porters | Postings from February 2008

Re: use encoding 'utf8' bug for Latin-1 range

Thread Previous | Thread Next
From:
demerphq
Date:
February 26, 2008 10:11
Subject:
Re: use encoding 'utf8' bug for Latin-1 range
Message ID:
9b18b3110802261011t74a772ces10cd9ceb3ae5bd7e@mail.gmail.com
On 21/02/2008, Juerd Waalboer <juerd@convolution.nl> wrote:
>  If this backwards incompatibility is ruled unimportant, the general
>  assumption would be: ${^ENCODING} acts on literal source code only, and
>  the fix would be to make numeric character values always unicode
>  codepoints. Is this correct?

Im wondering if there isnt another option actually. We could make the
rules for handling \x{} escapes under encoding be context sensitive.
If such an escape is in code such that it would form an illegal utf8
sequence then it is treated as a codepoint and not an octet. If it
would form a valid utf8 seqence then it is treated as a octet.

Thus:

"A \x{FF} B"

would be treated as a codepoint and

"A \x{c2}\x{a2} B"

would be treated as the utf8 encoding for chr(0xA2).

This would probably not break anyones program and might fix a few at
the same time.

And if people want the actual codepoints C2 and A2 they can use
\N{U+C2} to obtain it*

Cheers,
yves
Looking into this i noticed that under normal circumstances \N{U+C2}
does not return a utf8 string, which i find quite odd. I would expect
any string with an \N{} escape in it to be utf8. I should probably
file a bug about it.


-- 
perl -Mre=debug -e "/just|another|perl|hacker/"

Thread Previous | Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About