Front page | perl.perl5.porters |
Postings from July 2000
[ID 20000730.004] strangeness with Unicode
Thread Next
From:
Jeffrey Friedl
Date:
July 31, 2000 11:14
Subject:
[ID 20000730.004] strangeness with Unicode
Message ID:
200007302303.QAA10908@ventrue.yahoo.com
This is a bug report for perl from jfriedl@yahoo-inc.com,
generated with the help of perlbug 1.28 running under perl v5.6.0.
-----------------------------------------------------------------
[Please enter your report here]
This is another one where I hesitate to say it's a bug, since this is my
first venture into anything Unicode, but the action seems sufficiently
strange that I thought I'd post it.
Here's a test program that inspects the length of strings in a number
of ways:
#!/usr/local/bin/perl -w
use strict;
{ use bytes; } # just to make available later
use utf8;
my $smiley = "\x{263a}"; ## a smiley character
my $count = 0;
for my $string ("\x{263a}", # 1
$smiley, # 2
"" . $smiley, # 3
"" . "\x{263a}", # 4
$smiley . "", # 5
"\x{263a}" . "", # 6
"\x{263a}" . "\x{263a}", # 7
$smiley . $smiley, # 8
"\x{263a}\x{263a}", # 9
"$smiley$smiley", # 10
"\x{263a}" x 2, # 11
$smiley x 2, # 12
)
{
$count++;
my $chars = length($string); ## Unicode characters
my $bytes = bytes::length($string); ## raw bytes
my @regexchars = $string =~ m/(.)/g;
my $regexchars = @regexchars; ## chars as per the regex engine
my @splitchars = split //, $string;
my $splitchars = @splitchars; ## see how split counts them
print "$count: string [$string] has chars=$chars/$regexchars/$splitchars, bytes=$bytes\n";
}
Here's the output, piped through less (which shows hex codes for non-ASCII):
1: string [<E2><98><BA>] has chars=1/1/1, bytes=3
2: string [<E2><98><BA>] has chars=1/1/1, bytes=3
3: string [<E2><98><BA>] has chars=1/1/1, bytes=3
4: string [<E2><98><BA>] has chars=1/1/1, bytes=3
5: string [<E2><98><BA>] has chars=3/1/1, bytes=3
6: string [<E2><98><BA>] has chars=3/1/1, bytes=3
7: string [<C3><A2><C2><98><C2><BA><E2><98><BA>] has chars=4/4/4, bytes=9
8: string [<C3><A2><C2><98><C2><BA><E2><98><BA>] has chars=4/4/4, bytes=9
9: string [<E2><98><BA><E2><98><BA>] has chars=2/2/2, bytes=6
10: string [<C3><A2><C2><98><C2><BA><E2><98><BA>] has chars=4/4/4, bytes=9
11: string [<E2><98><BA><E2><98><BA>] has chars=6/2/2, bytes=6
12: string [<E2><98><BA><E2><98><BA>] has chars=6/2/2, bytes=6
The first four look fine to me, as <E2><98><BA> are the utf8 for the smiley:
% utf8-decode
Enter Unicode> <E2><98><BA>
Unicode 263A encoded in utf8 as a 3-byte sequence: <E2> <98> <BA>
WHITE SMILING FACE
So (Symbol, Other)
ON (Other Neutrals)
and indeed, when I view the output on a utf8 xterm, I see the smiley.
Lines 5 and 6 seem odd, since the length() is 3 instead of the 1 I'd expect.
As for the rest, 7-12, I'd expect them all to be like #9, which shows
correctly that the two smileys are two characters.
#11 and 12 just have the length() wrong, but the other three are really
wild. I'd expect 6 bytes to create the two characters, but as it is, there
are nine bytes to create four unicode characters:
% utf8-decode
Enter Unicode> <C3><A2><C2><98><C2><BA><E2><98><BA>
Unicode 00E2 encoded in utf8 as a 2-byte sequence: <C3> <A2>
LATIN SMALL LETTER A WITH CIRCUMFLEX
Ll (Letter, Lowercase)
decomp=[0061 0302]
has upper (00C2)
Unicode 0098 encoded in utf8 as a 2-byte sequence: <C2> <98>
<control>
Cc (Other, Control)
BN (Boundary Neutral)
Unicode 00BA encoded in utf8 as a 2-byte sequence: <C2> <BA>
MASCULINE ORDINAL INDICATOR
Ll (Letter, Lowercase)
decomp=[<super> 006F]
Unicode 263A encoded in utf8 as a 3-byte sequence: <E2> <98> <BA>
WHITE SMILING FACE
So (Symbol, Other)
ON (Other Neutrals)
But, at least the length() is correct for them.
So, it seems that there are two separate problems:
* length() not working correctly (examples 5,6, 11, 12)
* string concatination not working (examples 7, 8, 10)
But hey, I'm learning a lot about Unicode :-)
Jeffrey
[Please do not change anything below this line]
-----------------------------------------------------------------
---
Flags:
category=core
severity=medium
---
Site configuration information for perl v5.6.0:
Configured by jfriedl at Sat Jul 29 20:09:33 PDT 2000.
Summary of my perl5 (revision 5.0 version 6 subversion 0) configuration:
Platform:
osname=linux, osvers=2.2.15, archname=i686-linux
uname='linux fummy.dsl.yahoo.com 2.2.16 #6 smp sun jul 23 11:26:16 pdt 2000 i686 unknown '
config_args='-ds -e -A optimize=-g'
hint=previous, useposix=true, d_sigaction=define
usethreads=undef use5005threads=undef useithreads=undef usemultiplicity=undef
useperlio=undef d_sfio=undef uselargefiles=define
use64bitint=undef use64bitall=undef uselongdouble=undef usesocks=undef
Compiler:
cc='cc', optimize='-O2 -g', gccversion=pgcc-2.91.66 19990314 (egcs-1.1.2 release)
cppflags='-fno-strict-aliasing -I/usr/local/include -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64'
ccflags ='-fno-strict-aliasing -I/usr/local/include -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64'
stdchar='char', d_stdstdio=define, usevfork=false
intsize=4, longsize=4, ptrsize=4, doublesize=8
d_longlong=define, longlongsize=8, d_longdbl=define, longdblsize=12
ivtype='long', ivsize=4, nvtype='double', nvsize=8, Off_t='off_t', lseeksize=8
alignbytes=4, usemymalloc=n, prototype=define
Linker and Libraries:
ld='cc', ldflags =' -L/usr/local/lib'
libpth=/usr/local/lib /lib /usr/lib
libs=-lnsl -lndbm -lgdbm -ldb -ldl -lm -lc -lposix -lcrypt
libc=/lib/libc-2.1.1.so, so=so, useshrplib=false, libperl=libperl.a
Dynamic Linking:
dlsrc=dl_dlopen.xs, dlext=so, d_dlsymun=undef, ccdlflags='-rdynamic'
cccdlflags='-fpic', lddlflags='-shared -L/usr/local/lib'
Locally applied patches:
---
@INC for perl v5.6.0:
/home/jfriedl/lib/perl
/home/jfriedl/lib/perl/yahoo
/usr/local/lib/perl5/5.6.0/i686-linux
/usr/local/lib/perl5/5.6.0
/usr/local/lib/perl5/site_perl/5.6.0/i686-linux
/usr/local/lib/perl5/site_perl/5.6.0
/usr/local/lib/perl5/site_perl
.
---
Environment for perl v5.6.0:
HOME=/home/jfriedl
LANG (unset)
LANGUAGE (unset)
LD_LIBRARY_PATH=/usr/local/pgsql/lib:/home/jfriedl/src/rvplayer5.0
LOGDIR (unset)
PATH=/home/jfriedl/bin:/home/jfriedl/common/bin:/usr/local/gcc-2.95.2/bin:.:/usr/local/pgsql/bin:/usr/local/bin:/usr/X11R6/bin:/bin:/usr/bin:/usr/sbin:/sbin:/home/jfriedl/src/rvplayer5.0
PERLLIB=/home/jfriedl/lib/perl:/home/jfriedl/lib/perl/yahoo
PERL_BADLANG (unset)
SHELL=/bin/tcsh
Thread Next
-
[ID 20000730.004] strangeness with Unicode
by Jeffrey Friedl