Front page | perl.perl5.porters |
Postings from August 2012
[perl #114410] Substr giving wrong results on $1 with utf8
Thread Previous
From:
Father Chrysostomos via RT
Date:
August 30, 2012 13:24
Subject:
[perl #114410] Substr giving wrong results on $1 with utf8
Message ID:
rt-3.6.HEAD-11172-1346358278-1043.114410-15-0@perl.org
On Mon Aug 06 23:49:16 2012, choroba@matfyz.cz wrote:
> Running substr($1, 0, 1) gives strange results when matching a utf8
> string read from STDIN: sometimes, the string is longer, sometimes, it
> contains the EF-BF-BD replacement character.
>
> The following code should demonstrate the problem:
...
It doesn’t have to come from input. Here is a simpler example:
"\x{100}" =~ /(.+)/;
substr $1, 0, 1;
"a\x{100}" =~ /(.+)/;
warn ord substr $1, 1, 1;
And here is the output in various perl versions (5.17.3 is actually
v5.17.2-146-g7e2a0d4 and 5.17.4 is v5.17.3-139-g61dad97):
$ pbpaste|perl5.8.1
256 at - line 4.
$ pbpaste|perl5.8.9
256 at - line 4.
$ pbpaste|perl5.10.0
Malformed UTF-8 character (unexpected continuation byte 0x80, with no
preceding start byte) in ord at - line 4.
0 at - line 4.
$ pbpaste|perl5.10.1
Malformed UTF-8 character (unexpected continuation byte 0x80, with no
preceding start byte) in ord at - line 4.
0 at - line 4.
$ pbpaste|perl5.14.1
0 at - line 4.
$ pbpaste|perl5.16.0
0 at - line 4.
$ pbpaste|perl5.17.3
0 at - line 4.
$ pbpaste|perl5.17.4
Wide character in substr at - line 4.
panic: sv_pos_u2b_cache cache 3 real 1 for aĀ at - line 4.
What’s happening is that the utf-8 length/pos cache is becoming stale
without being reset.
The first substr results in pos information being cached. The second
pattern match changes the content of $1 (actually changes what $1 points
to underneath; $1’s contents are not updated until it is read). The
second substr reuses the cache that is still there.
I suspect we need to rethink the way the magic mechanism interacts with
utf8 caching.
It affects tied variables as well:
$y = "a\x{100}";
sub TIESCALAR{bless[]}
sub FETCH{$y}
tie $x, "";
warn ord substr $x, 0, 1;
$y = "\x{100}";
warn ord substr $x, 0, 1;
__END__
1 at - line 5.
Malformed UTF-8 character (unexpected non-continuation byte 0x00,
immediately after start byte 0xc4) in ord at - line 7.
0 at - line 7.
And substr lvalues:
$x = "a\x{100}";
$l = \substr $x, 0;
warn ord substr $$l, 1, 1;
substr $x, 0, 1, = "\x{100}";
warn ord substr $$l, 1, 1;
__END__
256 at - line 3.
Wide character in substr at - line 5.
panic: sv_pos_u2b_cache cache 4 real 2 for ĀĀ at - line 5.
And nonexistent hash elements:
sub {
$_[0] = "a\x{100}";
warn ord substr $_[0], 1, 1;
$h{k} = "\x{100}"x2;
warn ord substr $_[1], 1, 1;
}->($h{k});
__END__
256 at - line 3.
0 at - line 5.
--
Father Chrysostomos
---
via perlbug: queue: perl5 status: new
https://rt.perl.org:443/rt3/Ticket/Display.html?id=114410
Thread Previous