develooper Front page | perl.perl5.porters | Postings from July 2000

[ID 20000730.004] strangeness with Unicode

Thread Next
Jeffrey Friedl
July 31, 2000 11:14
[ID 20000730.004] strangeness with Unicode
Message ID:

This is a bug report for perl from,
generated with the help of perlbug 1.28 running under perl v5.6.0.

[Please enter your report here]

This is another one where I hesitate to say it's a bug, since this is my
first venture into anything Unicode, but the action seems sufficiently
strange that I thought I'd post it.

Here's a test program that inspects the length of strings in a number
of ways:

    #!/usr/local/bin/perl -w
    use strict;
    { use bytes; } # just to make available later
    use utf8;

    my $smiley = "\x{263a}"; ## a smiley character

    my $count = 0;
    for my $string ("\x{263a}",                     #  1
		    $smiley,                        #  2

		    "" . $smiley,                   #  3
		    "" . "\x{263a}",                #  4

		    $smiley    . "",                #  5
		    "\x{263a}" . "",                #  6

		    "\x{263a}" . "\x{263a}",        #  7
		    $smiley    . $smiley,           #  8

		    "\x{263a}\x{263a}",             #  9
		    "$smiley$smiley",               # 10

		    "\x{263a}" x 2,                 # 11
		    $smiley    x 2,                 # 12

	my $chars = length($string);          ## Unicode characters
	my $bytes = bytes::length($string);   ## raw bytes

	my @regexchars = $string =~ m/(.)/g;
	my $regexchars = @regexchars;         ## chars as per the regex engine

	my @splitchars = split //, $string; 
	my $splitchars = @splitchars;         ## see how split counts them

	print "$count: string [$string] has chars=$chars/$regexchars/$splitchars, bytes=$bytes\n";

Here's the output, piped through less (which shows hex codes for non-ASCII):

  1: string [<E2><98><BA>] has chars=1/1/1, bytes=3
  2: string [<E2><98><BA>] has chars=1/1/1, bytes=3
  3: string [<E2><98><BA>] has chars=1/1/1, bytes=3
  4: string [<E2><98><BA>] has chars=1/1/1, bytes=3
  5: string [<E2><98><BA>] has chars=3/1/1, bytes=3
  6: string [<E2><98><BA>] has chars=3/1/1, bytes=3
  7: string [<C3><A2><C2><98><C2><BA><E2><98><BA>] has chars=4/4/4, bytes=9
  8: string [<C3><A2><C2><98><C2><BA><E2><98><BA>] has chars=4/4/4, bytes=9
  9: string [<E2><98><BA><E2><98><BA>] has chars=2/2/2, bytes=6
 10: string [<C3><A2><C2><98><C2><BA><E2><98><BA>] has chars=4/4/4, bytes=9
 11: string [<E2><98><BA><E2><98><BA>] has chars=6/2/2, bytes=6
 12: string [<E2><98><BA><E2><98><BA>] has chars=6/2/2, bytes=6

The first four look fine to me, as <E2><98><BA> are the utf8 for the smiley:

    % utf8-decode
    Enter Unicode> <E2><98><BA>
    Unicode 263A encoded in utf8 as a 3-byte sequence: <E2> <98> <BA>
      So (Symbol, Other)
      ON (Other Neutrals)

and indeed, when I view the output on a utf8 xterm, I see the smiley.

Lines 5 and 6 seem odd, since the length() is 3 instead of the 1 I'd expect.

As for the rest, 7-12, I'd expect them all to be like #9, which shows
correctly that the two smileys are two characters.

#11 and 12 just have the length() wrong, but the other three are really
wild. I'd expect 6 bytes to create the two characters, but as it is, there
are nine bytes to create four unicode characters:

    % utf8-decode
    Enter Unicode> <C3><A2><C2><98><C2><BA><E2><98><BA>
    Unicode 00E2 encoded in utf8 as a 2-byte sequence: <C3> <A2>
      Ll (Letter, Lowercase)
      decomp=[0061 0302]
      has upper (00C2)
    Unicode 0098 encoded in utf8 as a 2-byte sequence: <C2> <98>
      Cc (Other, Control)
      BN (Boundary Neutral)
    Unicode 00BA encoded in utf8 as a 2-byte sequence: <C2> <BA>
      Ll (Letter, Lowercase)
      decomp=[<super> 006F]
    Unicode 263A encoded in utf8 as a 3-byte sequence: <E2> <98> <BA>
      So (Symbol, Other)
      ON (Other Neutrals)

But, at least the length() is correct for them.

So, it seems that there are two separate problems:

   * length() not working correctly (examples 5,6, 11, 12)
   * string concatination not working (examples 7, 8, 10)

But hey, I'm learning a lot about Unicode :-)

[Please do not change anything below this line]
Site configuration information for perl v5.6.0:

Configured by jfriedl at Sat Jul 29 20:09:33 PDT 2000.

Summary of my perl5 (revision 5.0 version 6 subversion 0) configuration:
    osname=linux, osvers=2.2.15, archname=i686-linux
    uname='linux 2.2.16 #6 smp sun jul 23 11:26:16 pdt 2000 i686 unknown '
    config_args='-ds -e -A optimize=-g'
    hint=previous, useposix=true, d_sigaction=define
    usethreads=undef use5005threads=undef useithreads=undef usemultiplicity=undef
    useperlio=undef d_sfio=undef uselargefiles=define 
    use64bitint=undef use64bitall=undef uselongdouble=undef usesocks=undef
    cc='cc', optimize='-O2 -g', gccversion=pgcc-2.91.66 19990314 (egcs-1.1.2 release)
    cppflags='-fno-strict-aliasing -I/usr/local/include -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64'
    ccflags ='-fno-strict-aliasing -I/usr/local/include -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64'
    stdchar='char', d_stdstdio=define, usevfork=false
    intsize=4, longsize=4, ptrsize=4, doublesize=8
    d_longlong=define, longlongsize=8, d_longdbl=define, longdblsize=12
    ivtype='long', ivsize=4, nvtype='double', nvsize=8, Off_t='off_t', lseeksize=8
    alignbytes=4, usemymalloc=n, prototype=define
  Linker and Libraries:
    ld='cc', ldflags =' -L/usr/local/lib'
    libpth=/usr/local/lib /lib /usr/lib
    libs=-lnsl -lndbm -lgdbm -ldb -ldl -lm -lc -lposix -lcrypt
    libc=/lib/, so=so, useshrplib=false, libperl=libperl.a
  Dynamic Linking:
    dlsrc=dl_dlopen.xs, dlext=so, d_dlsymun=undef, ccdlflags='-rdynamic'
    cccdlflags='-fpic', lddlflags='-shared -L/usr/local/lib'

Locally applied patches:

@INC for perl v5.6.0:

Environment for perl v5.6.0:
    LANG (unset)
    LANGUAGE (unset)
    LOGDIR (unset)
    PERL_BADLANG (unset)

Thread Next Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at | Group listing | About