develooper Front page | perl.perl5.porters | Postings from November 2003

Re: [perl #24541] substr and utf8 and use bytes

From:
SADAHIRO Tomoyuki
Date:
November 22, 2003 01:42
Subject:
Re: [perl #24541] substr and utf8 and use bytes
Message ID:
20031122183202.8C94.BQW10602@nifty.com

On 22 Nov 2003 02:43:38 -0000
William R Ward (via RT) <perlbug-followup@perl.org> wrote:

> We have a need to take a string containing utf8-encoded multibyte
> characters, and then, treating the string as bytes, extract a
> substring of N characters from it.
> 
> This is what "use bytes" was meant for, and it works great on Perl
> 5.6.1.  But in Perl 5.8.1 it corrupts the multi-byte characters in the
> process of extracting them.  The following test script illustrates the
> issue well.  Run it under 5.6.1 and 5.8.1 and check the difference.

Thank you for your report.
But the line
> eval ($] > 5.008) && binmode(FILE, ":utf8");
is obviously inappropriate for comparison between 5.6.1 and 5.8.1,
since this line is executed only under 5.8.1 .

Another snag is string concatenation in double quotes.
(cf. "after use bytes: str=$str, substr=$str2\n\n")

In 5.8.x string concatenation is not affected by bytes.pm.
When non-UTF-8 string is concatenated with UTF-8 string,
the former is "upgraded" (converted into UTF-8) as if it was
originally in latin1.
For these reasons, you should find $omega in substrings is
with six bytes on 5.8.1.

In 5.6.1, malformed UTF-8 is produced, but it is dangerous.

use bytes;
use Devel::Peek;
Dump("\x{100}" . "\xC5");

#Perl 5.8.1
SV = PV(0x1561cc4) at 0x155bcfc
  REFCNT = 1
  FLAGS = (PADBUSY,PADTMP,POK,READONLY,pPOK,UTF8)
  PV = 0x168085c "\304\200\303\205"\0 [UTF8 "\x{100}\x{c5}"]
  CUR = 4
  LEN = 5

#Perl 5.6.1
SV = PV(0x1673964) at 0x1668b38
  REFCNT = 1
  FLAGS = (POK,READONLY,pPOK,UTF8)
  PV = 0x16836a0 "\304\200\305"\0
  CUR = 3
  LEN = 4

Here is revised code.
In 5.8.1, all substrings are "\xE2\x84\xA61234567" (exactly 10 bytes)!
In 5.6.1, C<use utf8;> affects regular expressions in the block;
then not only C<use bytes;> but also C<no utf8;> is necessary.
5.8.x is great!

#!perl

use utf8;
$omega="\x{2126}";
$str=$omega.'1234567890';

open(FILE, ">","utf8-substr.out") || die "open - $!\n";

{
    use bytes;

    $str2=substr($str,0,10);
    print FILE "substr=$str2\n\n";

    $str2b = unpack("a10", $str);
    print FILE "substr=$str2b\n\n";

    ($str2c) = ($str =~ /(..........)/);
    print FILE "substr=$str2c\n\n";

    @chars = split "", $str;
    $str2d = join("", @chars[0..9]);
    print FILE "substr=$str2d\n\n";
}

close(FILE);

__END__

Regards,
SADAHIRO Tomoyuki




nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About