Front page | perl.perl5.porters |
Postings from November 2010
[perl #79824] hash value sharing breakage
Thread Next
From:
Zefram
Date:
November 27, 2010 08:51
Subject:
[perl #79824] hash value sharing breakage
Message ID:
rt-3.6.HEAD-13564-1290788426-724.79824-75-0@perl.org
# New Ticket Created by Zefram
# Please include the string: [perl #79824]
# in the subject line of all future correspondence about this issue.
# <URL: http://rt.perl.org/rt3/Ticket/Display.html?id=79824 >
This is a bug report for perl from zefram@fysh.org,
generated with the help of perlbug 1.36 running under perl 5.10.0.
-----------------------------------------------------------------
[Please enter your report here]
I have a problem where I generate a hash as a munged form of another
hash and end up with the keys in the munged hash filed under the wrong
hash value. The result is that lookups by key don't find the right
entry in the munged hash. Minimal demonstration:
$ perl -MEncode -lwe '%a=("L\x{c3}\x{a9}on"=>"acme"); ($k)=(keys %a); Encode::_utf8_on($k); %h = ($k => "acme"); print $h{"L\x{e9}on"}'
Use of uninitialized value $h{"L\351on"} in print at -e line 1.
$
The key in the original hash is the UTF-8 encoding of some Unicode
string, and the resulting key in the munged hash is the Perl form of
that Unicode string. For the purposes of this bug report, my UTF-8 is
always well-formed. The necessary elements for the problem to occur are
that a key is non-ASCII but Latin-1, I get the key out of the original
hash (via keys()), and I turn its UTF-8 flag on in place. In the real
application behind this there is a reason why I'm doing _utf8_on() rather
than decode(). It should suffice for now that the UTF-8 is well-formed,
so _utf8_on() apparently produces a well-formed UTF-8 scalar.
The lookup by key at the end fails because the hash entry has the wrong
hash value. Shown by Devel::Peek:
$ perl -MEncode -MDevel::Peek=Dump -lwe '%a=("L\x{c3}\x{a9}on"=>"acme"); ($k)=(keys %a); Encode::_utf8_on($k); %h = ($k => "acme"); Dump \%a; Dump \%h; Dump +{ "L\x{e9}on" => "acme" }; Dump +{ $k."" => "acme" }'
SV = RV(0x917376c) at 0x9173760
REFCNT = 1
FLAGS = (TEMP,ROK)
RV = 0x918b8a8
SV = PVHV(0x91785bc) at 0x918b8a8
REFCNT = 2
FLAGS = (OOK,SHAREKEYS)
ARRAY = 0x9206448 (0:7, 1:1)
hash quality = 100.0%
KEYS = 1
FILL = 1
MAX = 7
RITER = -1
EITER = 0x0
Elt "L\303\251on" HASH = 0x97a82c3
SV = PV(0x91706d0) at 0x9173880
REFCNT = 1
FLAGS = (POK,pPOK)
PV = 0x9186f98 "acme"\0
CUR = 4
LEN = 8
SV = RV(0x917376c) at 0x9173760
REFCNT = 1
FLAGS = (TEMP,ROK)
RV = 0x91cd410
SV = PVHV(0x91785d0) at 0x91cd410
REFCNT = 2
FLAGS = (SHAREKEYS,HASKFLAGS)
ARRAY = 0x918e9c8 (0:7, 1:1)
hash quality = 100.0%
KEYS = 1
FILL = 1
MAX = 7
RITER = -1
EITER = 0x0
Elt "L\303\251on" [UTF8 "L\x{e9}on"] HASH = 0x97a82c3
SV = PV(0x9170750) at 0x9173a20
REFCNT = 1
FLAGS = (POK,pPOK)
PV = 0x9183510 "acme"\0
CUR = 4
LEN = 8
SV = RV(0x917396c) at 0x9173960
REFCNT = 1
FLAGS = (TEMP,ROK)
RV = 0x9173760
SV = PVHV(0x91785e4) at 0x9173760
REFCNT = 1
FLAGS = (SHAREKEYS)
ARRAY = 0x918e9c8 (0:7, 1:1)
hash quality = 100.0%
KEYS = 1
FILL = 1
MAX = 7
RITER = -1
EITER = 0x0
Elt "L\351on" HASH = 0x6fc57033
SV = PV(0x9170a30) at 0x918b868
REFCNT = 1
FLAGS = (POK,pPOK)
PV = 0x9207200 "acme"\0
CUR = 4
LEN = 8
SV = RV(0x920ed34) at 0x920ed28
REFCNT = 1
FLAGS = (TEMP,ROK)
RV = 0x9173760
SV = PVHV(0x91785e4) at 0x9173760
REFCNT = 1
FLAGS = (SHAREKEYS,HASKFLAGS)
ARRAY = 0x918e9c8 (0:7, 1:1)
hash quality = 100.0%
KEYS = 1
FILL = 1
MAX = 7
RITER = -1
EITER = 0x0
Elt "L\303\251on" [UTF8 "L\x{e9}on"] HASH = 0x6fc57033
SV = PV(0x9170a98) at 0x918b868
REFCNT = 1
FLAGS = (POK,pPOK)
PV = 0x9205b10 "acme"\0
CUR = 4
LEN = 8
$
The latter two hashes show that the correct hash value for "L\x{e9}on"
is 0x6fc57033, which it gets regardless of whether the key is internally
represented in Latin-1 or UTF-8. In the faulty hash, however, it's
filed under hash value 0x97a82c3, which is the correct hash value for
"L\x{c3}\x{a9}on", the UTF-8 encoding as used in the initial hash.
The best clue I have as to why this is happening comes from dumping the
key scalar:
$ perl -MEncode -MDevel::Peek=Dump -lwe '%a=("L\x{c3}\x{a9}on"=>"acme"); ($k)=(keys %a); Dump $k; Encode::_utf8_on($k); Dump $k; %h = ($k => "acme"); print $h{"L\x{e9}on"}'
SV = PV(0x877e6c0) at 0x8781920
REFCNT = 1
FLAGS = (POK,FAKE,READONLY,pPOK)
PV = 0x87ace94 "L\303\251on"
CUR = 5
LEN = 0
SV = PV(0x877e6c0) at 0x8781920
REFCNT = 1
FLAGS = (POK,FAKE,READONLY,pPOK,UTF8)
PV = 0x87ace94 "L\303\251on" [UTF8 "L\x{e9}on"]
CUR = 5
LEN = 0
Use of uninitialized value $h{"L\351on"} in print at -e line 1.
$
What catches my eye here is the READONLY flag. The scalar $k does not
appear to be readonly at the Perl level; I can, for example, append to it
by `$k .= "";`. In fact, doing such an append works around the problem:
$ perl -MEncode -MDevel::Peek=Dump -lwe '%a=("L\x{c3}\x{a9}on"=>"acme"); ($k)=(keys %a); Dump $k; $k.=""; Encode::_utf8_on($k); Dump $k; %h = ($k => "acme"); print $h{"L\x{e9}on"}'
SV = PV(0x9c266c0) at 0x9c29920
REFCNT = 1
FLAGS = (POK,FAKE,READONLY,pPOK)
PV = 0x9c83d5c "L\303\251on"
CUR = 5
LEN = 0
SV = PV(0x9c266c0) at 0x9c29920
REFCNT = 1
FLAGS = (POK,pPOK,UTF8)
PV = 0x9c39510 "L\303\251on"\0 [UTF8 "L\x{e9}on"]
CUR = 5
LEN = 8
acme
$
I'm guessing (but haven't checked in the core source yet) that the
PV is being shared, and the hash value is being cached, keyed by the
PV address, which runs into trouble when the same PV represents two
different character sequences.
Now, another complication. My real code behind this actually isn't
calling Encode::_utf8_on(). It's XS code, and it's doing SvUTF8_on()
on an sv_mortalcopy() of the original key. If I change sv_mortalcopy()
to sv_2mortal(newSVsv()) then the problem goes away. This seems strange,
since both sv_mortalcopy() and newSVsv() claim (in perlapi(1)) to perform
their copying via sv_setsv(). Some sv_dump()ing around these operations
shows the same varying behaviour of the READONLY flag.
So, what are the rules about sharing PVs between scalars, and which
entities in this mess are infringing them?
[Please do not change anything below this line]
-----------------------------------------------------------------
---
Flags:
category=core
severity=medium
---
Site configuration information for perl 5.10.0:
Configured by Debian Project at Fri Aug 28 22:30:10 UTC 2009.
Summary of my perl5 (revision 5 version 10 subversion 0) configuration:
Platform:
osname=linux, osvers=2.6.26-2-amd64, archname=i486-linux-gnu-thread-multi
uname='linux puccini 2.6.26-2-amd64 #1 smp fri aug 14 07:12:04 utc 2009 i686 gnulinux '
config_args='-Dusethreads -Duselargefiles -Dccflags=-DDEBIAN -Dcccdlflags=-fPIC -Darchname=i486-linux-gnu -Dprefix=/usr -Dprivlib=/usr/share/perl/5.10 -Darchlib=/usr/lib/perl/5.10 -Dvendorprefix=/usr -Dvendorlib=/usr/share/perl5 -Dvendorarch=/usr/lib/perl5 -Dsiteprefix=/usr/local -Dsitelib=/usr/local/share/perl/5.10.0 -Dsitearch=/usr/local/lib/perl/5.10.0 -Dman1dir=/usr/share/man/man1 -Dman3dir=/usr/share/man/man3 -Dsiteman1dir=/usr/local/man/man1 -Dsiteman3dir=/usr/local/man/man3 -Dman1ext=1 -Dman3ext=3perl -Dpager=/usr/bin/sensible-pager -Uafs -Ud_csh -Ud_ualarm -Uusesfio -Uusenm -DDEBUGGING=-g -Doptimize=-O2 -Duseshrplib -Dlibperl=libperl.so.5.10.0 -Dd_dosuid -des'
hint=recommended, useposix=true, d_sigaction=define
useithreads=define, usemultiplicity=define
useperlio=define, d_sfio=undef, uselargefiles=define, usesocks=undef
use64bitint=undef, use64bitall=undef, uselongdouble=undef
usemymalloc=n, bincompat5005=undef
Compiler:
cc='cc', ccflags ='-D_REENTRANT -D_GNU_SOURCE -DDEBIAN -fno-strict-aliasing -pipe -I/usr/local/include -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64',
optimize='-O2 -g',
cppflags='-D_REENTRANT -D_GNU_SOURCE -DDEBIAN -fno-strict-aliasing -pipe -I/usr/local/include'
ccversion='', gccversion='4.3.2', gccosandvers=''
intsize=4, longsize=4, ptrsize=4, doublesize=8, byteorder=1234
d_longlong=define, longlongsize=8, d_longdbl=define, longdblsize=12
ivtype='long', ivsize=4, nvtype='double', nvsize=8, Off_t='off_t', lseeksize=8
alignbytes=4, prototype=define
Linker and Libraries:
ld='cc', ldflags =' -L/usr/local/lib'
libpth=/usr/local/lib /lib /usr/lib /usr/lib64
libs=-lgdbm -lgdbm_compat -ldb -ldl -lm -lpthread -lc -lcrypt
perllibs=-ldl -lm -lpthread -lc -lcrypt
libc=/lib/libc-2.7.so, so=so, useshrplib=true, libperl=libperl.so.5.10.0
gnulibc_version='2.7'
Dynamic Linking:
dlsrc=dl_dlopen.xs, dlext=so, d_dlsymun=undef, ccdlflags='-Wl,-E'
cccdlflags='-fPIC', lddlflags='-shared -O2 -g -L/usr/local/lib'
Locally applied patches:
---
@INC for perl 5.10.0:
/etc/perl
/usr/local/lib/perl/5.10.0
/usr/local/share/perl/5.10.0
/usr/lib/perl5
/usr/share/perl5
/usr/lib/perl/5.10
/usr/share/perl/5.10
/usr/local/lib/site_perl
.
---
Environment for perl 5.10.0:
HOME=/home/zefram
LANG (unset)
LANGUAGE (unset)
LD_LIBRARY_PATH (unset)
LOGDIR (unset)
PATH=/home/zefram/usr/perl/util:/home/zefram/pub/i686-pc-linux-gnu/bin:/home/zefram/pub/common/bin:/usr/bin:/usr/X11R6/bin:/bin:/usr/local/bin:/usr/games
PERL_BADLANG (unset)
SHELL=/usr/bin/zsh
Thread Next
-
[perl #79824] hash value sharing breakage
by Zefram