develooper Front page | perl.perl5.porters | Postings from February 2008

Re: use encoding 'utf8' bug for Latin-1 range

Thread Previous | Thread Next
From:
demerphq
Date:
February 26, 2008 14:22
Subject:
Re: use encoding 'utf8' bug for Latin-1 range
Message ID:
9b18b3110802261422h7a032ad9r27ccf61258a54258@mail.gmail.com
On 26/02/2008, Juerd Waalboer <juerd@convolution.nl> wrote:
> demerphq skribis 2008-02-26 19:11 (+0100):
>
> > On 21/02/2008, Juerd Waalboer <juerd@convolution.nl> wrote:
>  > >  If this backwards incompatibility is ruled unimportant, the general
>  > >  assumption would be: ${^ENCODING} acts on literal source code only, and
>  > >  the fix would be to make numeric character values always unicode
>  > >  codepoints. Is this correct?
>  > Im wondering if there isnt another option actually. We could make the
>  > rules for handling \x{} escapes under encoding be context sensitive.
>  > If such an escape is in code such that it would form an illegal utf8
>  > sequence then it is treated as a codepoint and not an octet. If it
>  > would form a valid utf8 seqence then it is treated as a octet.
>
>
> This kind of fallback I would prefer to see in several different places
>  of Perl's unicode support. If "use utf8;" supported this (latin1
>  fallback for invalid sequences) then it could be made default as was the
>  original plan, and 99% of the latin1 scripts would continue to work
>  without change.
>
>  However, continuing the support for \x that ${^ENCODING} has: first
>  interpret \x as bytes, and then decode, is wrong because this feature is
>  wrong. \x should be used only for character numbers, not for
>  bytes-that-subsequently-decoded. Whether those character numbers should
>  always be unicode (my strong preference), be symmetrical with ord (yes,
>  please!), and/or using the legacy charset (no thanks), is another
>  discussion.
>
>
>  > This would probably not break anyones program and might fix a few at
>  > the same time.
>
>
> This is true, and because of that it might be worth "fixing"
>  ${^ENCODING} this way while it is still being deprecated.
>
>  It is not, in my opinion, a good solution for the "we should support
>  scripts written in any encoding" problem. That problem, if it exists,
>  should be addressed with a new mechanism instead of by adding even more
>  complexity to an existing kludge.

I was thinking purely of \x escapes under use encoding 'utf8'.

>
>  > Looking into this i noticed that under normal circumstances \N{U+C2}
>  > does not return a utf8 string, which i find quite odd.
>
>
> Perl text strings are Unicode strings, that may be latin1 or utf8
>  encoded internally. Semantics before and after utf8::upgrade must not be
>  different. They are, and that should be considered a bug.
>
>  Having Perl use latin1 when possible is a very much desired performance
>  optimization.
>
>  lc, uc, lcfirst, ucfirst, //i, and character classes should be fixed to
>  be independent of the internal encoding.

Notice i said "a utf8 string", i didnt say " a unicode string". I
specifically meant that \N{U+...} should always result in a utf8
upgraded string regardless of codepoint. I realize (probably better
than many) that utf8 strings are not as efficient as latin-1 strings,
but i think any string containing an \N escape (which is documented as
unicode named sequences) should always return a utf8 string. Part of
the reason i think this is because something like \n{LATIN-SHARP-ESS}
(or whatever the hell its called, ive had a few beers tonite, i mean
german sharp-s YKWIM) DOES return a utf8 string despite it being in a
codepoint range where latin-1 overlaps. There is an inconsistancy if
\N{U+HEX} does not return a utf8 string when the same codepoint
refered to by name does.

>  > I would expect any string with an \N{} escape in it to be utf8. I
>  > should probably file a bug about it.
>
>
> UTF8 is not Unicode. ord("\N{U+C2}") == 0xC2, exactly the unicode
>  codepoint that was requested.

As i said above, i am talking about whether the string has its utf8
bit enabled or
not. I believe that any time \N is used in a string the string should
be implicitly upgraded. (There is actually core test code that makes
this assumption.)

Yves




-- 
perl -Mre=debug -e "/just|another|perl|hacker/"

Thread Previous | Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About