develooper Front page | perl.perl5.porters | Postings from February 2008

Re: use encoding 'utf8' bug for Latin-1 range

Thread Previous | Thread Next
From:
Juerd Waalboer
Date:
February 26, 2008 12:37
Subject:
Re: use encoding 'utf8' bug for Latin-1 range
Message ID:
20080226203352.GG13615@c4.convolution.nl
demerphq skribis 2008-02-26 19:11 (+0100):
> On 21/02/2008, Juerd Waalboer <juerd@convolution.nl> wrote:
> >  If this backwards incompatibility is ruled unimportant, the general
> >  assumption would be: ${^ENCODING} acts on literal source code only, and
> >  the fix would be to make numeric character values always unicode
> >  codepoints. Is this correct?
> Im wondering if there isnt another option actually. We could make the
> rules for handling \x{} escapes under encoding be context sensitive.
> If such an escape is in code such that it would form an illegal utf8
> sequence then it is treated as a codepoint and not an octet. If it
> would form a valid utf8 seqence then it is treated as a octet.

This kind of fallback I would prefer to see in several different places
of Perl's unicode support. If "use utf8;" supported this (latin1
fallback for invalid sequences) then it could be made default as was the
original plan, and 99% of the latin1 scripts would continue to work
without change.

However, continuing the support for \x that ${^ENCODING} has: first
interpret \x as bytes, and then decode, is wrong because this feature is
wrong. \x should be used only for character numbers, not for
bytes-that-subsequently-decoded. Whether those character numbers should
always be unicode (my strong preference), be symmetrical with ord (yes,
please!), and/or using the legacy charset (no thanks), is another
discussion.

> This would probably not break anyones program and might fix a few at
> the same time.

This is true, and because of that it might be worth "fixing"
${^ENCODING} this way while it is still being deprecated.

It is not, in my opinion, a good solution for the "we should support
scripts written in any encoding" problem. That problem, if it exists,
should be addressed with a new mechanism instead of by adding even more
complexity to an existing kludge.

> Looking into this i noticed that under normal circumstances \N{U+C2}
> does not return a utf8 string, which i find quite odd.

Perl text strings are Unicode strings, that may be latin1 or utf8
encoded internally. Semantics before and after utf8::upgrade must not be
different. They are, and that should be considered a bug.

Having Perl use latin1 when possible is a very much desired performance
optimization.

lc, uc, lcfirst, ucfirst, //i, and character classes should be fixed to
be independent of the internal encoding.

> I would expect any string with an \N{} escape in it to be utf8. I
> should probably file a bug about it.

UTF8 is not Unicode. ord("\N{U+C2}") == 0xC2, exactly the unicode
codepoint that was requested.
-- 
Met vriendelijke groet,  Kind regards,  Korajn salutojn,

  Juerd Waalboer:  Perl hacker  <#####@juerd.nl>  <http://juerd.nl/sig>
  Convolution:     ICT solutions and consultancy <sales@convolution.nl>

Thread Previous | Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About