develooper Front page | perl.perl5.porters | Postings from July 2000

uft8/chr()

Thread Next
From:
Jeffrey Friedl
Date:
July 31, 2000 11:14
Subject:
uft8/chr()
Message ID:
200007310144.SAA14426@ventrue.yahoo.com

Hi,
I'm playing with utf8, trying to understand both Unicode and how Perl
deals with it, and have run into some inconsistancies in Perl (or in my
understanding).

In
    #!/usr/local/bin/perl -w
    use strict;
    use utf8;
    print "\x{00E2}" , "\n";
    print chr(0x00E2), "\n";

I would have thought that the two prints would print the same thing (the
letter 'a' with ^ above), but the chr(0x00E2) returns a single byte with
the vaule 0xe2, not the expected UTF-8 0xc3 0xa2 sequence that "\x{00E2}"
returns.

It's not because I put the extra zeros (0x00E2 vs. 0xE2) that I thought it
should convert to utf8 for me, but because it was utf8 mode and I was
asking for the CHaRacter with value 0xE2.

I guess I can/should use pack('U', $value), but the chr() approach seemed
more natural to me.




Here's a semi-related question. If I have a string of bytes that I know
to be valid UTF-8, how can I get Perl to consider them as such? I'd hoped
that I could just stuff them into a string in a 'use bytes' block, but
I get mixed results:

Here's a small script:
    #!/usr/local/bin/perl -w
    use strict;

    ## stuff raw UTF-8 bytes into string.
    my $string = do {
       use bytes;
       "\xC3\xA2"; ## 'a' with ^ above
    };

    use utf8;

    if (m/^(?:\p{IsLu})*$/) {
       print "is a lowercase letter\n";
    }

    ## count length via length()
    my $length = length($string);

    ## count length via regex
    my @chars = $string =~ m/./g;
    my $count = @chars;

    print "length=$length, regex=$count: [$string]\n";


When I run it, I get:

    Use of uninitialized value in pattern match (m//) at test line 12.
    is a lowercase letter
    length=2, regex=1: [{C3}{A2}]

The regex=1 shows that the regex engine did consider it to be a single
character, but length() still thought that it was two. (I've run into
problems with length() in other situations, and included it among the
severl perlbugs I submitted this weekend.)

The check to see if it's a \p{IsLu} succeeded, which is good, but there's
that uninitialized value warning, so it could be coincidental that it
passed. The re 'debug' output has a lot of stuff in it that must be from
other packages, but perhaps it's helpful in seeing where the uninitialized
value is comming from:

    % ./test |& perl -pe 's/[\x80-\xff]/sprintf "{%02X}", ord($&)/ge'
    Compiling REx `^(?:\p{IsLu})*$'
    size 6 Compiling REx `::'
    size 3 first at 1
       1: EXACT <::>(3)
       3: END(0)
    anchored `::' at 0 (checking anchored isall) minlen 2 
    Compiling REx `^(I[sn]|To)([A-Z].*)'
    size 36 first at 2
       1: BOL(2)
       2: OPEN1(4)
       4:   BRANCH(16)
       5:     EXACT <I>(7)
       7:     ANYOF[ns](19)
      16:   BRANCH(19)
      17:     EXACT <To>(19)
      19: CLOSE1(21)
      21: OPEN2(23)
      23:   ANYOF[A-Z](32)
      32:   STAR(34)
      33:     REG_ANY(0)
      34: CLOSE2(36)
      36: END(0)
    anchored(BOL) minlen 3 
    Compiling REx `^'
    size 2 first at 2
       1: MBOL(2)
       2: END(0)
    stclass `END' anchored(MBOL) minlen 0 
    Compiling REx `^&'
    size 4 first at 2
       1: BOL(2)
       2: EXACT <&>(4)
       4: END(0)
    anchored `&' at 0 (checking anchored) anchored(BOL) minlen 1 
    Compiling REx `\W'
    size 2 first at 1
       1: NALNUM(2)
       2: END(0)
    stclass `NALNUM' minlen 1 
    Matching REx `\W' against `confess'
    Matching REx `\W' against `croak'
    Matching REx `\W' against `carp'
    Compiling REx `^[^0-9a-fA-F]'
    size 11 first at 2
       1: BOL(2)
       2: ANYOF[\0-/:-@G-`g-\377](11)
      11: END(0)
    stclass `ANYOF[\0-/:-@G-`g-\377]' anchored(BOL) minlen 1 
    Compiling REx `^([0-9a-fA-F]+)'
    size 16 first at 2
    synthetic stclass `ANYOF[0-9A-Fa-f]'.
       1: BOL(2)
       2: OPEN1(4)
       4:   PLUS(14)
       5:     ANYOF[0-9A-Fa-f](0)
      14: CLOSE1(16)
      16: END(0)
    stclass `ANYOF[0-9A-Fa-f]' anchored(BOL) minlen 1 
    Compiling REx `\tXXXX$'
    size 5 first at 1
       1: EXACT <	XXXX>(4)
       4: MEOL(5)
       5: END(0)
    anchored `	XXXX'$ at 0 (checking anchored isall) minlen 5 
    Compiling REx `^([0-9a-fA-F]+)(?:\t([0-9a-fA-F]+)?)(?:\t([0-9a-fA-F]+))?'
    size 56 first at 2
    synthetic stclass `ANYOF[0-9A-Fa-f]'.
       1: MBOL(2)
       2: OPEN1(4)
       4:   PLUS(14)
       5:     ANYOF[0-9A-Fa-f](0)
      14: CLOSE1(16)
      16: EXACT <	>(18)
      18: CURLYX {0,1}(35)
      20:   OPEN2(22)
      22:     PLUS(32)
      23:       ANYOF[0-9A-Fa-f](0)
      32:   CLOSE2(34)
      34:   WHILEM[1/2](0)
      35: NOTHING(36)
      36: CURLYX {0,1}(55)
      38:   EXACT <	>(40)
      40:   OPEN3(42)
      42:     PLUS(52)
      43:       ANYOF[0-9A-Fa-f](0)
      52:   CLOSE3(54)
      54:   WHILEM[2/2](0)
      55: NOTHING(56)
      56: END(0)
    floating `	' at 1..2147483647 (checking floating) stclass `ANYOF[0-9A-Fa-f]' anchored(MBOL) minlen 2 
    Compiling REx `^([^0-9a-fA-F\n])(.*)'
    size 21 first at 2
    synthetic stclass `ANYOF[\0-\11\13-/:-@G-`g-\377]'.
       1: MBOL(2)
       2: OPEN1(4)
       4:   ANYOF[\0-\11\13-/:-@G-`g-\377](13)
      13: CLOSE1(15)
      15: OPEN2(17)
      17:   STAR(19)
      18:     REG_ANY(0)
      19: CLOSE2(21)
      21: END(0)
    stclass `ANYOF[\0-\11\13-/:-@G-`g-\377]' anchored(MBOL) minlen 1 
    Compiling REx `[-+!]'
    size 10 first at 1
       1: ANYOF[!+\-](10)
      10: END(0)
    stclass `ANYOF[!+\-]' minlen 1 
    Compiling REx `::'
    size 3 first at 1
       1: EXACT <::>(3)
       3: END(0)
    anchored `::' at 0 (checking anchored isall) minlen 2 
    Compiling REx `^([0-9a-fA-F]+)(?:\t([0-9a-fA-F]+)?)(?:\t([0-9a-fA-F]+))?'
    size 56 first at 2
    synthetic stclass `ANYOF[0-9A-Fa-f]'.
       1: MBOL(2)
       2: OPEN1(4)
       4:   PLUS(14)
       5:     ANYOF[0-9A-Fa-f](0)
      14: CLOSE1(16)
      16: EXACT <	>(18)
      18: CURLYX {0,1}(35)
      20:   OPEN2(22)
      22:     PLUS(32)
      23:       ANYOF[0-9A-Fa-f](0)
      32:   CLOSE2(34)
      34:   WHILEM[1/2](0)
      35: NOTHING(36)
      36: CURLYX {0,1}(55)
      38:   EXACT <	>(40)
      40:   OPEN3(42)
      42:     PLUS(52)
      43:       ANYOF[0-9A-Fa-f](0)
      52:   CLOSE3(54)
      54:   WHILEM[2/2](0)
      55: NOTHING(56)
      56: END(0)
    floating `	' at 1..2147483647 (checking floating) stclass `ANYOF[0-9A-Fa-f]' anchored(MBOL) minlen 2 
    Compiling REx `^([0-9a-fA-F]+)(?:\t([0-9a-fA-F]+))?'
    size 36 first at 2
    synthetic stclass `ANYOF[0-9A-Fa-f]'.
       1: MBOL(2)
       2: OPEN1(4)
       4:   PLUS(14)
       5:     ANYOF[0-9A-Fa-f](0)
      14: CLOSE1(16)
      16: CURLYX {0,1}(35)
      18:   EXACT <	>(20)
      20:   OPEN2(22)
      22:     PLUS(32)
      23:       ANYOF[0-9A-Fa-f](0)
      32:   CLOSE2(34)
      34:   WHILEM[1/1](0)
      35: NOTHING(36)
      36: END(0)
    stclass `ANYOF[0-9A-Fa-f]' anchored(MBOL) minlen 1 
    Compiling REx `^([-+!])(.*)'
    size 21 first at 2
    synthetic stclass `ANYOF[!+\-]'.
       1: MBOL(2)
       2: OPEN1(4)
       4:   ANYOF[!+\-](13)
      13: CLOSE1(15)
      15: OPEN2(17)
      17:   STAR(19)
      18:     REG_ANY(0)
      19: CLOSE2(21)
      21: END(0)
    stclass `ANYOF[!+\-]' anchored(MBOL) minlen 1 
    first at 2
       1: BOL(2)
       2: STAR(5)
       3:   ANYOFUTF8{i}[^!-$&')*0A-DFG`cefhj{85}{86}{A0}{A3}{A4}{A5}{C3}{C6}{C9}{CC}{D1}{D2}{D3}{D5}-{D9}{DB}{DC}{DD}{DF}{E2}{E5}\w\W\s\S\d[:alnum:][:ascii:][:^ascii:][:ctrl:][:^ctrl:][:lower:][:^lower:][:print:][:^print:][:^punct:][:xdigit:]](0)
       5: EOL(6)
       6: END(0)
    floating `'$ at 0..2147483647 (checking floating) anchored(BOL) minlen 0 
    Compiling REx `.'
    size 2 first at 1
       1: ANYUTF8(2)
       2: END(0)
    minlen 1 
    Use of uninitialized value in pattern match (m//) at utf8-4 line 12.
    Guessing start of match, REx `^(?:\p{IsLu})*$' against `'...
    Found floating substr `'$ at offset 0...
    Guessed: match at offset 0
    Matching REx `^(?:\p{IsLu})*$' against `'
      Setting an EVAL scope, savestack=5
       0 <> <>                |  1:  BOL
       0 <> <>                |  2:  STAR
			       ANYOFUTF8{i}[^!-$&')*0A-DFG`cefhj{85}{86}{A0}{A3}{A4}{A5}{C3}{C6}{C9}{CC}{D1}{D2}{D3}{D5}-{D9}{DB}{DC}{DD}{DF}{E2}{E5}\w\W\s\S\d[:alnum:][:ascii:][:^ascii:][:ctrl:][:^ctrl:][:lower:][:^lower:][:print:][:^print:][:^punct:][:xdigit:]] can match 0 times out of 32767...
      Setting an EVAL scope, savestack=5
       0 <> <>                |  5:    EOL
       0 <> <>                |  6:    END
    Match successful!
    is a lowercase letter
    Matching REx `.' against `{C3}{A2}'
      Setting an EVAL scope, savestack=7
       0 <> <{C3}{A2}>              |  1:  ANYUTF8
       2 <{C3}{A2}> <>              |  2:  END
    Match successful!
    length=2, regex=1: [{C3}{A2}]
    Freeing REx: `^(?:\p{IsLu})*$'
    Freeing REx: `.'


Ideas?
	Jeffrey
------------------------------------------------------------------------------
Jeffrey Friedl <jfriedl@yahoo-inc.com> Yahoo! Finance http://finance.yahoo.com

Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About