Front page | perl.perl5.porters |
Postings from July 2000
uft8/chr()
Thread Next
From:
Jeffrey Friedl
Date:
July 31, 2000 11:14
Subject:
uft8/chr()
Message ID:
200007310144.SAA14426@ventrue.yahoo.com
Hi,
I'm playing with utf8, trying to understand both Unicode and how Perl
deals with it, and have run into some inconsistancies in Perl (or in my
understanding).
In
#!/usr/local/bin/perl -w
use strict;
use utf8;
print "\x{00E2}" , "\n";
print chr(0x00E2), "\n";
I would have thought that the two prints would print the same thing (the
letter 'a' with ^ above), but the chr(0x00E2) returns a single byte with
the vaule 0xe2, not the expected UTF-8 0xc3 0xa2 sequence that "\x{00E2}"
returns.
It's not because I put the extra zeros (0x00E2 vs. 0xE2) that I thought it
should convert to utf8 for me, but because it was utf8 mode and I was
asking for the CHaRacter with value 0xE2.
I guess I can/should use pack('U', $value), but the chr() approach seemed
more natural to me.
Here's a semi-related question. If I have a string of bytes that I know
to be valid UTF-8, how can I get Perl to consider them as such? I'd hoped
that I could just stuff them into a string in a 'use bytes' block, but
I get mixed results:
Here's a small script:
#!/usr/local/bin/perl -w
use strict;
## stuff raw UTF-8 bytes into string.
my $string = do {
use bytes;
"\xC3\xA2"; ## 'a' with ^ above
};
use utf8;
if (m/^(?:\p{IsLu})*$/) {
print "is a lowercase letter\n";
}
## count length via length()
my $length = length($string);
## count length via regex
my @chars = $string =~ m/./g;
my $count = @chars;
print "length=$length, regex=$count: [$string]\n";
When I run it, I get:
Use of uninitialized value in pattern match (m//) at test line 12.
is a lowercase letter
length=2, regex=1: [{C3}{A2}]
The regex=1 shows that the regex engine did consider it to be a single
character, but length() still thought that it was two. (I've run into
problems with length() in other situations, and included it among the
severl perlbugs I submitted this weekend.)
The check to see if it's a \p{IsLu} succeeded, which is good, but there's
that uninitialized value warning, so it could be coincidental that it
passed. The re 'debug' output has a lot of stuff in it that must be from
other packages, but perhaps it's helpful in seeing where the uninitialized
value is comming from:
% ./test |& perl -pe 's/[\x80-\xff]/sprintf "{%02X}", ord($&)/ge'
Compiling REx `^(?:\p{IsLu})*$'
size 6 Compiling REx `::'
size 3 first at 1
1: EXACT <::>(3)
3: END(0)
anchored `::' at 0 (checking anchored isall) minlen 2
Compiling REx `^(I[sn]|To)([A-Z].*)'
size 36 first at 2
1: BOL(2)
2: OPEN1(4)
4: BRANCH(16)
5: EXACT <I>(7)
7: ANYOF[ns](19)
16: BRANCH(19)
17: EXACT <To>(19)
19: CLOSE1(21)
21: OPEN2(23)
23: ANYOF[A-Z](32)
32: STAR(34)
33: REG_ANY(0)
34: CLOSE2(36)
36: END(0)
anchored(BOL) minlen 3
Compiling REx `^'
size 2 first at 2
1: MBOL(2)
2: END(0)
stclass `END' anchored(MBOL) minlen 0
Compiling REx `^&'
size 4 first at 2
1: BOL(2)
2: EXACT <&>(4)
4: END(0)
anchored `&' at 0 (checking anchored) anchored(BOL) minlen 1
Compiling REx `\W'
size 2 first at 1
1: NALNUM(2)
2: END(0)
stclass `NALNUM' minlen 1
Matching REx `\W' against `confess'
Matching REx `\W' against `croak'
Matching REx `\W' against `carp'
Compiling REx `^[^0-9a-fA-F]'
size 11 first at 2
1: BOL(2)
2: ANYOF[\0-/:-@G-`g-\377](11)
11: END(0)
stclass `ANYOF[\0-/:-@G-`g-\377]' anchored(BOL) minlen 1
Compiling REx `^([0-9a-fA-F]+)'
size 16 first at 2
synthetic stclass `ANYOF[0-9A-Fa-f]'.
1: BOL(2)
2: OPEN1(4)
4: PLUS(14)
5: ANYOF[0-9A-Fa-f](0)
14: CLOSE1(16)
16: END(0)
stclass `ANYOF[0-9A-Fa-f]' anchored(BOL) minlen 1
Compiling REx `\tXXXX$'
size 5 first at 1
1: EXACT < XXXX>(4)
4: MEOL(5)
5: END(0)
anchored ` XXXX'$ at 0 (checking anchored isall) minlen 5
Compiling REx `^([0-9a-fA-F]+)(?:\t([0-9a-fA-F]+)?)(?:\t([0-9a-fA-F]+))?'
size 56 first at 2
synthetic stclass `ANYOF[0-9A-Fa-f]'.
1: MBOL(2)
2: OPEN1(4)
4: PLUS(14)
5: ANYOF[0-9A-Fa-f](0)
14: CLOSE1(16)
16: EXACT < >(18)
18: CURLYX {0,1}(35)
20: OPEN2(22)
22: PLUS(32)
23: ANYOF[0-9A-Fa-f](0)
32: CLOSE2(34)
34: WHILEM[1/2](0)
35: NOTHING(36)
36: CURLYX {0,1}(55)
38: EXACT < >(40)
40: OPEN3(42)
42: PLUS(52)
43: ANYOF[0-9A-Fa-f](0)
52: CLOSE3(54)
54: WHILEM[2/2](0)
55: NOTHING(56)
56: END(0)
floating ` ' at 1..2147483647 (checking floating) stclass `ANYOF[0-9A-Fa-f]' anchored(MBOL) minlen 2
Compiling REx `^([^0-9a-fA-F\n])(.*)'
size 21 first at 2
synthetic stclass `ANYOF[\0-\11\13-/:-@G-`g-\377]'.
1: MBOL(2)
2: OPEN1(4)
4: ANYOF[\0-\11\13-/:-@G-`g-\377](13)
13: CLOSE1(15)
15: OPEN2(17)
17: STAR(19)
18: REG_ANY(0)
19: CLOSE2(21)
21: END(0)
stclass `ANYOF[\0-\11\13-/:-@G-`g-\377]' anchored(MBOL) minlen 1
Compiling REx `[-+!]'
size 10 first at 1
1: ANYOF[!+\-](10)
10: END(0)
stclass `ANYOF[!+\-]' minlen 1
Compiling REx `::'
size 3 first at 1
1: EXACT <::>(3)
3: END(0)
anchored `::' at 0 (checking anchored isall) minlen 2
Compiling REx `^([0-9a-fA-F]+)(?:\t([0-9a-fA-F]+)?)(?:\t([0-9a-fA-F]+))?'
size 56 first at 2
synthetic stclass `ANYOF[0-9A-Fa-f]'.
1: MBOL(2)
2: OPEN1(4)
4: PLUS(14)
5: ANYOF[0-9A-Fa-f](0)
14: CLOSE1(16)
16: EXACT < >(18)
18: CURLYX {0,1}(35)
20: OPEN2(22)
22: PLUS(32)
23: ANYOF[0-9A-Fa-f](0)
32: CLOSE2(34)
34: WHILEM[1/2](0)
35: NOTHING(36)
36: CURLYX {0,1}(55)
38: EXACT < >(40)
40: OPEN3(42)
42: PLUS(52)
43: ANYOF[0-9A-Fa-f](0)
52: CLOSE3(54)
54: WHILEM[2/2](0)
55: NOTHING(56)
56: END(0)
floating ` ' at 1..2147483647 (checking floating) stclass `ANYOF[0-9A-Fa-f]' anchored(MBOL) minlen 2
Compiling REx `^([0-9a-fA-F]+)(?:\t([0-9a-fA-F]+))?'
size 36 first at 2
synthetic stclass `ANYOF[0-9A-Fa-f]'.
1: MBOL(2)
2: OPEN1(4)
4: PLUS(14)
5: ANYOF[0-9A-Fa-f](0)
14: CLOSE1(16)
16: CURLYX {0,1}(35)
18: EXACT < >(20)
20: OPEN2(22)
22: PLUS(32)
23: ANYOF[0-9A-Fa-f](0)
32: CLOSE2(34)
34: WHILEM[1/1](0)
35: NOTHING(36)
36: END(0)
stclass `ANYOF[0-9A-Fa-f]' anchored(MBOL) minlen 1
Compiling REx `^([-+!])(.*)'
size 21 first at 2
synthetic stclass `ANYOF[!+\-]'.
1: MBOL(2)
2: OPEN1(4)
4: ANYOF[!+\-](13)
13: CLOSE1(15)
15: OPEN2(17)
17: STAR(19)
18: REG_ANY(0)
19: CLOSE2(21)
21: END(0)
stclass `ANYOF[!+\-]' anchored(MBOL) minlen 1
first at 2
1: BOL(2)
2: STAR(5)
3: ANYOFUTF8{i}[^!-$&')*0A-DFG`cefhj{85}{86}{A0}{A3}{A4}{A5}{C3}{C6}{C9}{CC}{D1}{D2}{D3}{D5}-{D9}{DB}{DC}{DD}{DF}{E2}{E5}\w\W\s\S\d[:alnum:][:ascii:][:^ascii:][:ctrl:][:^ctrl:][:lower:][:^lower:][:print:][:^print:][:^punct:][:xdigit:]](0)
5: EOL(6)
6: END(0)
floating `'$ at 0..2147483647 (checking floating) anchored(BOL) minlen 0
Compiling REx `.'
size 2 first at 1
1: ANYUTF8(2)
2: END(0)
minlen 1
Use of uninitialized value in pattern match (m//) at utf8-4 line 12.
Guessing start of match, REx `^(?:\p{IsLu})*$' against `'...
Found floating substr `'$ at offset 0...
Guessed: match at offset 0
Matching REx `^(?:\p{IsLu})*$' against `'
Setting an EVAL scope, savestack=5
0 <> <> | 1: BOL
0 <> <> | 2: STAR
ANYOFUTF8{i}[^!-$&')*0A-DFG`cefhj{85}{86}{A0}{A3}{A4}{A5}{C3}{C6}{C9}{CC}{D1}{D2}{D3}{D5}-{D9}{DB}{DC}{DD}{DF}{E2}{E5}\w\W\s\S\d[:alnum:][:ascii:][:^ascii:][:ctrl:][:^ctrl:][:lower:][:^lower:][:print:][:^print:][:^punct:][:xdigit:]] can match 0 times out of 32767...
Setting an EVAL scope, savestack=5
0 <> <> | 5: EOL
0 <> <> | 6: END
Match successful!
is a lowercase letter
Matching REx `.' against `{C3}{A2}'
Setting an EVAL scope, savestack=7
0 <> <{C3}{A2}> | 1: ANYUTF8
2 <{C3}{A2}> <> | 2: END
Match successful!
length=2, regex=1: [{C3}{A2}]
Freeing REx: `^(?:\p{IsLu})*$'
Freeing REx: `.'
Ideas?
Jeffrey
------------------------------------------------------------------------------
Jeffrey Friedl <jfriedl@yahoo-inc.com> Yahoo! Finance http://finance.yahoo.com
Thread Next