develooper Front page | perl.perl5.porters | Postings from August 2013

Re: [perl #119089] Shared references (threads::shared) disappear onsub return

Thread Previous | Thread Next
From:
Nicholas Clark
Date:
August 5, 2013 14:19
Subject:
Re: [perl #119089] Shared references (threads::shared) disappear onsub return
Message ID:
20130805141944.GQ4940@plum.flirble.org
On Wed, Jul 31, 2013 at 01:10:33PM +0100, Dave Mitchell wrote:
> On Tue, Jul 30, 2013 at 10:58:36PM -0700, Alexey Borzenkov wrote:

> > When using threads::shared on Perl 5.14 using many kinds of
> > expressions in sub's return becomes very dangerous, as return
> > values just become undef undef certain circumstances. For example
> > this program would die in Perl 5.14, but would work normally in
> > Perl 5.12 or Perl 5.16:
> > 
> >     use strict;
> >     use threads ();
> >     use threads::shared;
> > 
> >     sub test_clone {
> >         my $ref = shared_clone([{a => 1, b => 2}]);
> >         return $ref->[0];
> >     }
> > 
> >     die "not defined" unless defined(test_clone);

This can be simplified to:

#!./perl
use strict;
use threads ();
use threads::shared;

sub test_clone {
    my @a :shared;
    $a[0] = 6;
    $a[0];
}

die "not defined" unless defined test_clone;
__END__

> > This causes threaded perl programs on Debian 7.0 or Ubuntu 12.04
> > to produce really weird failures.
> 
> This was fixed in perl 5.15.7 with commit
>     6f48390ab209d16ee8f795f0a83677c8bd9ac69c,
> which wont be back-ported to 5.14.x as its out of its maintenance period.
> 
> You can work around the issue on older perls by assigning the value to
> a temporary var before returning it; i.e.
> 
>     my $ret = $ref->[0];
>     return $ret;

Totally true.

If assertions are enabled the test case triggers an assertion failure on
5.14.0. Given this, I was suspicious as to what introduced the assertion
failure, it turns out that there's a heck of a lot more to this than first
meets the eye.

So a git bisect (with assertions enabled) shows that commit 6f48390ab209d16e
(as mentioned above) makes the test case work. It *also* makes the test case
work if assertions are disabled.

commit 6f48390ab209d16ee8f795f0a83677c8bd9ac69c
Author: Father Chrysostomos <sprout@cpan.org>
Date:   Wed Jan 4 23:28:54 2012 -0800

    [perl #95548] Returned magical temps are not copied
    
    return and leavesub, for speed, were not copying temp variables with a
    refcount of 1, which is fine as long as the fact that it was not cop-
    ied is not observable.
    
    With magical variables, that *can* be observed, so we have to forego
    the optimisation and copy the variable if's magical.
    
    This obviously applies only to rvalue subs.



So, what caused the assertions to start failing between v5.12 and v5.14?
A simple bisect suggests that it's this commit:


commit f2338a2e8347fc967ab6b9af21d948258b88e341
Author: David Mitchell <davem@iabyn.com>
Date:   Thu Apr 15 10:20:50 2010 +0100

    use cBOOL for bool casts
    
    bool b = (bool)some_int
    
    doesn't necessarily do what you think. In some builds, bool is defined as
    char, and that cast's behaviour is thus undefined. So this line in mg.c:
    
        const bool was_temp = (bool)SvTEMP(sv);
    
    was actually setting was_temp to false even when the SVs_TEMP flag was set.
    Fix this by replacing all the (bool) casts with a new cBOOL() cast macro
    that (hopefully) does the right thing.




which has nothing directly to do with threads shared, or (not?) copying
values on subroutine return. It turns out after a bit of digging that it's
this section of that change that matters:

diff --git a/mg.c b/mg.c
index 3fb8ec4..4a8d767 100644
--- a/mg.c
+++ b/mg.c
@@ -193,7 +193,7 @@ Perl_mg_get(pTHX_ SV *sv)
 {
     dVAR;
     const I32 mgs_ix = SSNEW(sizeof(MGS));
-    const bool was_temp = (bool)SvTEMP(sv);
+    const bool was_temp = cBOOL(SvTEMP(sv));
     bool have_new = 0;
     MAGIC *newmg, *head, *cur, *mg;
     /* guard against sv having being freed midway by holding a private



It's not obvious from the code above, but SvTEMP(sv) returns either 0 or
0x00080000, and in 2010, C<bool> was typedef'd as C<char>, so the above code
always set was_temp to 0, due to truncation. The cast was added to avoid a
warning about the truncation. The (buggy) truncation itself was added when
the type of was_temp was changed from a suitably large integer to a bool as
part of this commit:

commit 35a4481cfdbca4941ab3a4206dc266f3e71c2385
Author: Andy Lester <andy@petdance.com>
Date:   Sun Mar 13 08:20:05 2005 -0600

    Adding const qualifiers
    Message-ID: <20050313202005.GA23535@petdance.com>
    
    p4raw-id: //depot/perl@24037


(note that the commit message doesn't even note that some variables' types
were changed).


The variable in question is used to enable Perl_mg_get() to retain the
SVs_TEMP bit set on values it is processing. This is needed because
Perl_sv_2mortal() conflates "temporary" with "mortal" and automatically sets
the SVs_TEMP flag bit on everything. Hence this code seeks to remove
SVs_TEMP if Perl_sv_2mortal() set it. The upshot of the bug was that
SVs_TEMP was always being removed, so code in subroutine entry that does
*not* copy TEMPs is skipped because the value is becoming unmarked. Hence
the values are always copied.


So, all that conceals what actually going on. When that truncation bug was
first introduced, it didn't matter that TEMPs weren't copied. When that bug
was fixed, it mattered a lot that TEMPs weren't copied. What changed in the
middle?


I set out to bisect between the two commits, patching *that* the bug to fix
it. The code doesn't change much, and test-building on the change points
reveals that at the revision with the last change to that part of the code
(changing a line adjacent to C<was_temp>), fixing *that* bug doesn't break
the test case.

So I made a patch to fix the bug, and bisected with that:

Porting/bisect.pl -Dusethreads --start f9c6fee519b868a2e8ef8c5b701c0d3f95565423 --end db63319f533e643ef6aac622fcae9a2f7ceabb0d --late-fixup ../119089-W -Dnoextensions=Encode ../perl/119089


That reveals that the true trigger of the behaviour regression was this
change:

commit fd69380d5d5b95ef16e2521cf4251b34ee0ce151
Author: David Mitchell <davem@iabyn.com>
Date:   Tue Mar 23 12:11:43 2010 +0000

    Fix assorted bugs related to magic (such as pos) not "sticking" to
    magical array and hash elements; e.g. the following looped infinitely:
    
        $h{tainted_element} =~ /..../g
    
    There are two side-effects of this fix.
    First, MGf_GSKIP has been extended to work on tied array
    elements as well as hash elements. This is the mechanism that skips all
    but the first tied element magic gets until after the next set.
    Second, rvalue hash/array element access where the element has get magic,
    now directly returns the element rather than a mortal copy.
    
    The root cause of the bug was code similar to the following in pp_alem,
    pp_aelemfast, pp_helem and pp_rv2av:
    
        if (!lval && SvGMAGICAL(sv))	/* see note in pp_helem() */
    	sv = sv_mortalcopy(sv);
    
    According to the note, this was added in 1998 to make this work:
    
        local $tied{foo} = $tied{foo}
    
    Since it returns a copy rather than the element, this make //g fail.
    My first attempt, a few years ago,  to fix this, took the approach that
    the LHS of the bind should be made an lvalue in the presence of //g, since
    it now modifies its LHS; i.e.
    
        expr =~ //        expr is rvalue
        expr =~ s///      expr is lvalue
        expr =~ //g       expr was rvalue, I proposed to change it to lvalue
    
    Unfortunately this fix broke too much stuff (stuff that was arguably
    already broken, but it upset people). For example, f() ~= s////
    correctly gives the error
    
        Can't modify non-lvalue subroutine call
    
    My fix extended f() =~ //g to give the same error. Which is reasonable,
    because the g isn't doing what you want. But plenty of people had code that
    only needed to match once and the g had just been cargo-culted. So it
    broke their working code. So lets not do this.
    
    My new approach has been to remove the sv_mortalcopy(). It turns out
    that this is no longer needed to fix the local $tied{foo} issue.
    Presumably that went away as a side-effect of my container/value magic
    localisation rationalisation of a few years ago, although I haven't
    analysed it - just noted that the tests still pass (!). However, an issue
    with removing it is that mg_get() no longer gets called. So a plain
    
        $tied_hash{elem};
    
    in void context no longer calls FETCH(). Which broke some tests and might
    break some code. Also, there's an issue with the delayed calling of magic
    in @+[n] and %+{foo}; by the time the get magic is called, the original
    pattern may have gone out of scope.
    
    The solution is to simply replace the original
    
    	sv = sv_mortalcopy(sv);
    
    with
    
    	mg_get(sv);
    
    This then caused problems with tied array FETCH() getting called too much.
    I fixed this by extending the MGf_GSKIP mechanism to tied arrays as well
    as hashes. I don't understand why tied arrays have always been treated
    differently than tied hashes, but unifying them didn't seem to break
    anything (except for a Storable test, whose comment indicated that the
    test's author thought FETCH() was being called to often anyway).



The key part is this:


diff --git a/pp_hot.c b/pp_hot.c
index 3371e88..8f8af53 100644
--- a/pp_hot.c
+++ b/pp_hot.c
@@ -658,7 +658,7 @@ PP(pp_aelemfast)
     SV *sv = (svp ? *svp : &PL_sv_undef);
     EXTEND(SP, 1);
     if (!lval && SvGMAGICAL(sv))	/* see note in pp_helem() */
-	sv = sv_mortalcopy(sv);
+	mg_get(sv);
     PUSHs(sv);
     RETURN;
 }



ie it *gets rid of a copy*. So this removes one place that copied the return
value, and the cast fix removes the other place, hence why both are needed
to trigger the problem. And why the identified patch fixes it, by
re-instating a copy.


The story is not over.


So, I assumed that taking blead, and reverting the patch that fixes the test
case, would cause the test case to fail again. It doesn't. More bisecting
revealed that subsequently this change also has an influence:

commit 4bac9ae47b5ad7845a24e26b0e95609805de688a
Author: Chip Salzenberg <chip@pobox.com>
Date:   Fri Jun 22 15:18:18 2012 -0700

    Magic flags harmonization.
    
    In restore_magic(), which is called after any magic processing, all of
    the public OK flags have been shifted into the private OK flags.  Thus
    the lack of an appropriate public OK flags was used to trigger both get
    magic and required conversions.  This scheme did not cover ROK, however,
    so all properly written code had to make sure mg_get was called the right
    number of times anyway.  Meanwhile the private OK flags gained a second
    purpose of marking converted but non-authoritative values (e.g. the IV
    conversion of an NV), and the inadequate flag shift mechanic broke this
    in some cases.
    
    This patch removes the shift mechanic for magic flags, thus exposing (and
    fixing) some improper usage of magic SVs in which mg_get() was not called
    correctly.  It also has the side effect of making magic get functions
    specifically set their SVs to undef if that is desired, as the new behavior
    of empty get functions is to leave the value unchanged.  This is a feature,
    as now get magic that does not modify its value, e.g. tainting, does not
    have to be special cased.
    
    The changes to cpan/ here are only temporary, for development only, to
    keep blead working until upstream applies them (or something like them).
    
    Thanks to Rik and Father C for review input.




because part of it removes the entire was_temp (micro?) optimisation logic:


@@ -3302,12 +3285,8 @@ S_restore_magic(pTHX_ const void *p)
 	       So artificially keep it alive a bit longer.
 	       We avoid turning on the TEMP flag, which can cause the SV's
 	       buffer to get stolen (and maybe other stuff). */
-	    int was_temp = SvTEMP(sv);
 	    sv_2mortal(sv);
-	    if (!was_temp) {
-		SvTEMP_off(sv);
-	    }
-	    SvOK_off(sv);
+	    SvTEMP_off(sv);
 	}
 	else
 	    SvREFCNT_dec(sv); /* undo the inc in S_save_magic() */


meaning that values will always be copied.

(Given the nature of the first commit identified as a bug fix, this seems
reasonable. magic values shouldn't be being short circuited.)


So that explains the given test case. But the story is still not over.

Whilst temporary values should be copied (by default), LVALUEs returned from
:lvalue subs should not. This will fail assertions, SEGV, or worse even on
blead:

#!./perl
use strict;
use threads ();
use threads::shared;
use Devel::Peek;

sub test_clone :lvalue {
    my @a :shared;
    $a[0];
}

#Dump(test_clone);
test_clone() = 0;
__END__


I believe that the problem is because threads::shared wrongly assumes that
the proxy for a shared aggregate will always outlive any LVALUE proxies for
its elements. This is generally true, but not for element proxies returned
from :lvalue subroutines for aggregate proxies which go out of scope with
the subroutine.

I believe that the fix is to change threads::shared to deal with cleanup
differently, so that the last magic struct which frees up the relevant
mg_obj is responsible for cleaning up shared space. This means adding "free"
magic to the element proxies:


diff --git a/dist/threads-shared/shared.xs b/dist/threads-shared/shared.xs
index d3e859d..07379cc 100644
--- a/dist/threads-shared/shared.xs
+++ b/dist/threads-shared/shared.xs
@@ -1027,6 +1027,27 @@ sharedsv_elem_mg_DELETE(pTHX_ SV *sv, MAGIC *mg)
     return (0);
 }
 
+int
+sharedsv_elem_mg_free(pTHX_ SV *sv, MAGIC *mg)
+{
+    dTHXc;
+    PERL_UNUSED_ARG(sv);
+    ENTER_LOCK;
+    if (mg->mg_obj) {
+        if (!PL_dirty) {
+            assert(SvROK(mg->mg_obj));
+        }
+        if (SvREFCNT(mg->mg_obj) == 1) {
+            /* If the element has the last pointer to the shared aggregate, then
+               it has to free the shared aggregate.  mg->mg_obj itself is freed
+               by Perl_mg_free()  */
+            S_sharedsv_dec(aTHX_ SHAREDSV_FROM_OBJ(mg->mg_obj));
+        }
+    }
+    LEAVE_LOCK;
+    return (0);
+}
+
 /* Called during cloning of PERL_MAGIC_tiedelem(p) magic in new
  * thread */
 
@@ -1044,7 +1065,7 @@ MGVTBL sharedsv_elem_vtbl = {
     sharedsv_elem_mg_STORE,     /* set */
     0,                          /* len */
     sharedsv_elem_mg_DELETE,    /* clear */
-    0,                          /* free */
+    sharedsv_elem_mg_free,      /* free */
     0,                          /* copy */
     sharedsv_elem_mg_dup,       /* dup */
 #ifdef MGf_LOCAL
@@ -1114,7 +1135,21 @@ int
 sharedsv_array_mg_free(pTHX_ SV *sv, MAGIC *mg)
 {
     PERL_UNUSED_ARG(sv);
-    S_sharedsv_dec(aTHX_ (SV*)mg->mg_ptr);
+    if (!PL_dirty) {
+        assert(mg->mg_obj);
+        assert(SvROK(mg->mg_obj));
+        assert(SvUV(SvRV(mg->mg_obj)) == PTR2UV(mg->mg_ptr));
+    }
+    if (mg->mg_obj) {
+        if (SvREFCNT(mg->mg_obj) == 1) {
+            S_sharedsv_dec(aTHX_ (SV*)mg->mg_ptr);
+        } else {
+            /* An element of this aggregate still has PERL_MAGIC_tied(p)
+               pointing to this shared aggregate.  It will take responsibility
+               for freeing the shared aggregate.  Perl_mg_free() drops the
+               reference count on mg->mg_obj.  */
+        }
+    }
     return (0);
 }


This change has been stewing on the branch smoke-me/nicholas/RT119089-variant
and doesn't seem to have any problems. (The commit message is now not correct
- 9b9aaeae0f62 fixed the problem on 5.12.0. Global destruction*)

I think that it doesn't introduce any leaks, but I'd appreciate a second pair
of eyes.

This fix to threads::shared has the side effect of also fixing the original
reported bug when running on v5.14

Nicholas Clark

* Global destruction. threads::shared does not clean up "shared space", even
  when PL_destruct_level is 2. There is no C<perl_destruct> corresponding to
  the C<perl_alloc> in Perl_sharedsv_init(). I'm about to report that as a
  separate bug.

Thread Previous | Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About