develooper Front page | perl.perl5.porters | Postings from November 2003

Re: [perl #24541] substr and utf8 and use bytes

November 22, 2003 01:42
Re: [perl #24541] substr and utf8 and use bytes
Message ID:

On 22 Nov 2003 02:43:38 -0000
William R Ward (via RT) <> wrote:

> We have a need to take a string containing utf8-encoded multibyte
> characters, and then, treating the string as bytes, extract a
> substring of N characters from it.
> This is what "use bytes" was meant for, and it works great on Perl
> 5.6.1.  But in Perl 5.8.1 it corrupts the multi-byte characters in the
> process of extracting them.  The following test script illustrates the
> issue well.  Run it under 5.6.1 and 5.8.1 and check the difference.

Thank you for your report.
But the line
> eval ($] > 5.008) && binmode(FILE, ":utf8");
is obviously inappropriate for comparison between 5.6.1 and 5.8.1,
since this line is executed only under 5.8.1 .

Another snag is string concatenation in double quotes.
(cf. "after use bytes: str=$str, substr=$str2\n\n")

In 5.8.x string concatenation is not affected by
When non-UTF-8 string is concatenated with UTF-8 string,
the former is "upgraded" (converted into UTF-8) as if it was
originally in latin1.
For these reasons, you should find $omega in substrings is
with six bytes on 5.8.1.

In 5.6.1, malformed UTF-8 is produced, but it is dangerous.

use bytes;
use Devel::Peek;
Dump("\x{100}" . "\xC5");

#Perl 5.8.1
SV = PV(0x1561cc4) at 0x155bcfc
  REFCNT = 1
  PV = 0x168085c "\304\200\303\205"\0 [UTF8 "\x{100}\x{c5}"]
  CUR = 4
  LEN = 5

#Perl 5.6.1
SV = PV(0x1673964) at 0x1668b38
  REFCNT = 1
  PV = 0x16836a0 "\304\200\305"\0
  CUR = 3
  LEN = 4

Here is revised code.
In 5.8.1, all substrings are "\xE2\x84\xA61234567" (exactly 10 bytes)!
In 5.6.1, C<use utf8;> affects regular expressions in the block;
then not only C<use bytes;> but also C<no utf8;> is necessary.
5.8.x is great!


use utf8;

open(FILE, ">","utf8-substr.out") || die "open - $!\n";

    use bytes;

    print FILE "substr=$str2\n\n";

    $str2b = unpack("a10", $str);
    print FILE "substr=$str2b\n\n";

    ($str2c) = ($str =~ /(..........)/);
    print FILE "substr=$str2c\n\n";

    @chars = split "", $str;
    $str2d = join("", @chars[0..9]);
    print FILE "substr=$str2d\n\n";



SADAHIRO Tomoyuki Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at | Group listing | About