develooper Front page | perl.perl5.porters | Postings from April 2011

The Unicode Bug still bites?

Thread Previous | Thread Next
From:
Tom Christiansen
Date:
April 13, 2011 12:02
Subject:
The Unicode Bug still bites?
Message ID:
7729.1302721341@chthon
I started off trying to fix this paragraph:

    In C<quotemeta> or its inline equivalent C<\Q>,  all characters whose
    code points are above 127 are not quoted in UTF-8 encoded strings, but
    all are quoted in UTF-8 strings.

That (still) makes no sense to me.  Here's the wording I came up with that
reflects what I *thought* it was trying to say:

    In C<quotemeta> or its inline equivalent C<\Q>, no characters        
    code points above 127 are quoted in UTF-8 encoded strings, but in  
    byte encoded strings, code points between 128-255 are always quoted.

Except that that is not true. :(  I've played with blead, including
compiled afresh this morning, and on both Darwin and Linux, and I still
can't figure out what is supposed to happen, because it doesn't match
either of those paragraphs above.  I think from looking at Devel::Peek 
that things aren't being properly utf8'd.  This should not be happening 
according to what I think that that should be saying:

    % blead -CS -M-feature=unicode_strings -le '$a = "\x{e9}";  print quotemeta($a)'
    \é
    % blead -CS -Mfeature=unicode_strings  -le '$a = "\x{e9}";  print quotemeta($a)'
    \é

This happens on both Darwin and Mac, and I don't understand why with -E or unicode_strings
that I have a non-Unicode String!

    % blead -CS -MDevel::Peek -E '$a = "\x{e9}";  say "\Q$a"'
    \é
    % blead -CS -MDevel::Peek -E '$a = "\x{e9}";  Dump "\Q$a"'
    SV = PV(0x8010d8) at 0x80ed20
      REFCNT = 1
      FLAGS = (PADTMP,POK,pPOK)
      PV = 0x203dc0 "\\\351"\0
      CUR = 2
      LEN = 16

    % blead -CS -MDevel::Peek -Mfeature=unicode_strings -le '$a = "\x{e9}";  Dump($a)'
    SV = PV(0x801038) at 0x80ed60
      REFCNT = 1
      FLAGS = (POK,pPOK)
      PV = 0x201380 "\351"\0
      CUR = 1
      LEN = 16
    % blead -CS -MDevel::Peek -Mfeature=unicode_strings -le '$a = "\x{e9}";  Dump("\Q$a")'
    SV = PV(0x8010e8) at 0x80ed30
      REFCNT = 1
      FLAGS = (PADTMP,POK,pPOK)
      PV = 0x203e00 "\\\351"\0
      CUR = 2
      LEN = 16

But look!

    % blead -CS -MDevel::Peek -E '$a = "\x{e9}";  utf8::upgrade($a) ; say "\Q$a"'
    é
    % blead -CS -MDevel::Peek -E '$a = "\x{e9}";  utf8::upgrade($a) ; Dump "\Q$a"'
    SV = PV(0x8010d8) at 0x80f040
      REFCNT = 1
      FLAGS = (PADTMP,POK,pPOK,UTF8)
      PV = 0x203df0 "\303\251"\0 [UTF8 "\x{e9}"]
      CUR = 2
      LEN = 16

I thought the whole point was so I didn't have to *do* that anymore. :(

--tom

Thread Previous | Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About