develooper Front page | perl.perl5.porters | Postings from August 2012

[perl #89032] utf8-bracket support

From:
Linda Walsh via RT
Date:
August 27, 2012 00:53
Subject:
[perl #89032] utf8-bracket support
Message ID:
rt-3.6.HEAD-11172-1346053998-1739.89032-15-0@perl.org
On Wed Jul 18 05:01:47 2012, tom christiansen wrote:
> Linda W <perl-diddler@tlinx.org> wrote
>    on Wed, 18 Jul 2012 02:07:16 PDT: 
> 
> >> I'm afraid that you're really rather horribly confused about all this.
> >>
> >> You have managed to get yourself into a snit because you've unwittingly
> >> conflated logical code points, internal representations, and particular
> >> encodings.  Since you've gotten this wrong, nothing that follows from 
> >> your false premise is meaningful.
-----
Really.  Perhaps my perceptions are not always correct, are you really so
sure yours are always correct?

> > Normally,  have PERL5OPT set to -CSA, "use utf8" in my source and a
> > UTF-8 environment, but I often don't get consistent results for chars
> > in the range  0x7f-0xff.
> 
> That's still vague.
> 
> Are you using unicode_strings in your source?  
----

I filed a bug, that more clearly elucidates what I am seeing as a problem.

You can call it confusion, but, if such exists, its because  someone
thought the chars 127-255 could be left unencoded because they have the
same code point value in UTF8 as in LATIN1.  This is my perception of
the bug in
perl -- if that is incorrect, please correct me -- i.e. explain how it
is wrong.

Example.

I have "use utf8" in my code and have a sub name using the script 'f':
'ƒ' (U+192)
Now you may believe I am confusing codepoint U+192 with the UTF-chars
\xc6\x92, but
They don't look a thing alike.

Now Perl -- it seems confused, as it thinks the UTF-8 encoding of U+192
are themselves
code points even though I hve -CSA set in my perl5opts.

When it prints out I see: "ƒRegister_FStype"
"(U+C6)(U+92)Register_FStype"... The U+C6 and U+92
that were the utf-8 representation of U+192 in my source were
incorrectly converted by perl
into UTF-8 AGAIN.. because there is a bug in how perl interprets chars
0x80-0xff -- instead of decoding
the \xc6\x92 in my source correctly as code point U+192, it incorrectly
*redecodes it into UTF-8 again,
resulting in the byte sequence \xc3\x86\xc2\x92.  

So Please tell me, who doesn't understand the difference between code
points and
their encoding, me?  or Perl... 

Is this clear enough for you?




---
via perlbug:  queue: perl5 status: open
https://rt.perl.org:443/rt3/Ticket/Display.html?id=89032



nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About