develooper Front page | perl.perl5.porters | Postings from August 2012

[perl #89032] utf8-bracket support

Linda Walsh via RT
August 27, 2012 00:53
[perl #89032] utf8-bracket support
Message ID:
On Wed Jul 18 05:01:47 2012, tom christiansen wrote:
> Linda W <> wrote
>    on Wed, 18 Jul 2012 02:07:16 PDT: 
> >> I'm afraid that you're really rather horribly confused about all this.
> >>
> >> You have managed to get yourself into a snit because you've unwittingly
> >> conflated logical code points, internal representations, and particular
> >> encodings.  Since you've gotten this wrong, nothing that follows from 
> >> your false premise is meaningful.
Really.  Perhaps my perceptions are not always correct, are you really so
sure yours are always correct?

> > Normally,  have PERL5OPT set to -CSA, "use utf8" in my source and a
> > UTF-8 environment, but I often don't get consistent results for chars
> > in the range  0x7f-0xff.
> That's still vague.
> Are you using unicode_strings in your source?  

I filed a bug, that more clearly elucidates what I am seeing as a problem.

You can call it confusion, but, if such exists, its because  someone
thought the chars 127-255 could be left unencoded because they have the
same code point value in UTF8 as in LATIN1.  This is my perception of
the bug in
perl -- if that is incorrect, please correct me -- i.e. explain how it
is wrong.


I have "use utf8" in my code and have a sub name using the script 'f':
'ƒ' (U+192)
Now you may believe I am confusing codepoint U+192 with the UTF-chars
\xc6\x92, but
They don't look a thing alike.

Now Perl -- it seems confused, as it thinks the UTF-8 encoding of U+192
are themselves
code points even though I hve -CSA set in my perl5opts.

When it prints out I see: "ƒRegister_FStype"
"(U+C6)(U+92)Register_FStype"... The U+C6 and U+92
that were the utf-8 representation of U+192 in my source were
incorrectly converted by perl
into UTF-8 AGAIN.. because there is a bug in how perl interprets chars
0x80-0xff -- instead of decoding
the \xc6\x92 in my source correctly as code point U+192, it incorrectly
*redecodes it into UTF-8 again,
resulting in the byte sequence \xc3\x86\xc2\x92.  

So Please tell me, who doesn't understand the difference between code
points and
their encoding, me?  or Perl... 

Is this clear enough for you?

via perlbug:  queue: perl5 status: open Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at | Group listing | About