develooper Front page | perl.perl5.porters | Postings from October 2014

Re: sv_grow() and malloc()

Thread Previous | Thread Next
From:
Dave Mitchell
Date:
October 7, 2014 16:20
Subject:
Re: sv_grow() and malloc()
Message ID:
20141007162010.GR5204@iabyn.com
On Thu, Sep 25, 2014 at 09:12:40AM +0200, demerphq wrote:
> On 23 September 2014 13:49, Dave Mitchell <davem@iabyn.com> wrote:
> 
> > (this is spin-off from Reini's suggested work to tweak sv_grow()
> > allocations).
> >
> > I've been thinking about about all our heuristics for growing (and
> > over-growing) strings in sv_grow(), and how it's likely to interact with
> > the OS's malloc() library.
> >
> 
> At least one OS (sun maybe?) supports an introspection
> routine Perl_safesysmalloc_size which exposes the canonical form of this
> information.
> 
> Use of this routine in sv.c caused problems with the new COW logic and the
> readline stuff so I disabled it. It can be re-enabled by defining:
> 
> PERL_UNWARANTED_CHUMMINESS_WITH_MALLOC
> 
> I do not recommend doing this, it confuses the COW logic quite a bit.


I've spent the last week or two (on and off) looking into how all the
malloc and COW stuff interacts.

TL;DR: COW caused problems with memory wastage when using readline, so
sv_setsv_flags() was modified to skip COW if SvLEN was quite bigger than
SvCUR; this then caused problems on platforms with malloc_size() (Darwin
and perl's malloc as far as I know), so calling malloc_size() was
disabled. I believe that the SvCUR verses SvLEN test is sub-optimal in
some cases, and can be tweaked, after which malloc_size() can be
re-enabled. Further, I believe a run-time startup probe of malloc() and
realloc() behaviour on platforms without malloc_size() can provide much of
the benefit of malloc_size().


Background:

There are two core functions I'm interested in here: sv_setsv_flags(),
which for strings, decides when to copy, and when to COW; and sv_grow(),
which mallocs or reallocs the string buffer when the SV is first used as a
string, and when it is subsequently needed to hold a bigger string.

sv_grow() applies various fiddle factors to the requested length; these
come in two classes; first those where we think we're smarter than the
caller (e.g. sometimes over-allocating by typically 25% to avoid lots of
reallocs if the string is being repeatedly appended to), and second where
we start being chummy with the malloc library. The first class  I think
can be improved, but I'm going to leave that discussion for another time.

As regards chumminess, some platforms have a malloc_size() function which
returns, for a malloc()ed address, the real size allocated, which is likely
to be larger than the requested size due to alignment, implementation and
efficiency considerations. Where available, perl uses this function to
post-hoc adjust SvLEN. On platforms without malloc_size(), we instead
round the request up to sizeof(size_t), on the grounds that the malloc
library is likely to do the same. So overall, sv_grow(sv, len) looks
something like:

    len += adjust for various adjusts;

    #ifdef Perl_safesysmalloc_size
        s = malloc(len);
        SvPV_set(sv, s);
        SvLEN_set(sv, Perl_safesysmalloc_size(s));
    #else
        len = round_up(len);
        SvPV_set(sv, malloc(len));
        SvLEN_set(sv, len);
    #endif

The other function of interest, sv_setsv_flags() (which is the thing
called for perl-level scalar assignment), has a set of criteria to decide
when to copy the string, when to steal it, and when to share it (COW). For
example, if the string is relatively short and the destination string
already has a buffer big enough to hold it, then it's cheaper to just copy
the string rather than set up COW.


The issues:

A while back we had an issue with COW and readline() interacting:
code like this:

    push @a, $_ while (<>);

was causing each pushed SV to be large, even if the lines were short. This
was due to sv_gets() initially allocating a large buffer to read the first
line; rather than a copy being pushed, the original large SV is COWed and
pushed; when readline read the second line, it noticed that the buffer was
shared, abandoned it, and created a new (large) buffer. It's generally
agreed that the whole readline()/sv_gets() buffer allocation strategy
needs rethinking anyway, but in the meantime, and also to help avoid
similar situations elsewhere, Yves added to 5.20.0, with commit
e8c6a474e88610b7, some extra criteria to sv_setsv_flags() on when to skip
COW: namely when the src's SvLEN is quite a bit bigger than its SvCUR,
(for some definition of "quite".) This solved the readline() issue.

This subsequently caused an issue on Darwin (and any platform that has
malloc_size()):

    [perl #121975] COW speedup lost after e8c6a474

In that ticket, something like:

    $x = "x" x 1_000_000;
    $y = $x for 1..1_000_000;

was slow (took minutes) up to and including 5.18.0; it then became fast
(< 1 sec) when COW was enabled; it became slow again when the SvCUR verses
SvLEN checks were added. This was because perl was doing malloc(1000002),
and Darwin's malloc() library was rounding this to a nice block, 1003520
(0xf4242 => 0xf5000). When we set SvLEN to malloc_size(), we ended up with
SvCUR = 100000, SvLEN = 1003520; the difference between the two was enough
to trigger the "skip COW" test. Yves worked round this by disabling calls
to malloc_size() by default, with ce861ea796308:

    temporary fix for [perl #121975] COW speedup lost after e8c6a474


What I propose:

I think that the criteria for skipping COW on divergent SvCUR/SvLEN
need tweaking; I'll discuss this further below. Once that's fixed,
I think it will be safe to revert ce861ea796308, and platforms that have
malloc_size() will again benefit from this slight optimisation.

Further, based on the outputs of that little program I asked people
to run on various platforms, I think it is possible at perl startup time,
to do a probe consisting approximately of

    malloc(1); realloc($_) for 2..N
    
which will allow us to determine two constants A and B such that the size
in any call to malloc() or realloc() in sv_grow() can be rounded up to to
A+nB for some n with no extra memory wasted (for example on 64-bit linux,
A=24, B=16). This would replace the existing "round up to sizeof(size_t)"
on non-malloc_size() platforms (equivalent to A=8,B=8). We'd continue to
use malloc_size() on platforms that support it. It may also be the case
that there are other platform-specific malloc() introspection functions
that we could add Configure probes for and use where available. But I'll
want to think further about the A+nB and probing issues and discuss this
later.

Finally, there are issues with sv_grow()'s "I know better than the caller"
over-allocation strategies; which I want to fix, but again I'll discuss
this further later.

What now:

So the main thing I want to do right now is tweak the SvCUR verses SvLEN
testing, then revert the malloc_size() disabling. The three tables below
show how sv_setsv_flags() decides whether to steal, copy or COW based on
various values of SvCUR of the source string, plus various other criteria
(some of which I have not mentioned since they are irrelevant to this
discussion, such as the cow refcount being > SV_COW_REFCNT_MAX).

The main action shown in each table is that which prevailed before the
e8c6a474e886 additional "SvLEN >> SvCUR" restrictions; the comments in
[square brackets] describe the additional restrictions following that
commit, while the numbers in {} are examples of my proposed changes to the
criteria (see later more details).

The main pre-patch criterion was to copy rather than COW if the string was
short (< 1250 bytes) and would already fit in the destination buffer (and
the buffer already existed); the added post-patch criteria were to skip
COW (and thus favour copy) if SvLEN > 2*SvCUR or SvLEN > SvCUR + 80.


First, test whether the buffer is swipeable:

    just steal the buffer from the src SV if:
        src not COW, RC==1, and various other constraints,
        and the SV is either SVs_TEMP or SVs_PADTMP:

     src:CUR  SVs_TEMP    SVs_PADTMP
         ---  --------    ----------
           1     STEAL    DONT-STEAL
           2     STEAL    DONT-STEAL
           3     STEAL    DONT-STEAL
          ..
          78     STEAL    DONT-STEAL
          79     STEAL    DONT-STEAL
          80     STEAL    DONT-STEAL
          81     STEAL    DONT-STEAL
          82     STEAL    DONT-STEAL
          83     STEAL    DONT-STEAL
         ...
        1249     STEAL    DONT-STEAL
        1250     STEAL    STEAL [if src:LEN < 1330   {1562} ]
        1251     STEAL    STEAL [if src:LEN < 1331   {1563} ]
        1252     STEAL    STEAL [if src:LEN < 1333   {1565} ]
        ....
     1000000     STEAL    STEAL [if src:LEN < 100080 {1250000} ]
     1000001     STEAL    STEAL [if src:LEN < 100081 {1250001} ]


If not swipeable, then test for COWability. There are two sets of
criteria, based on whether the src SV is already marked as COW or not
(there is a greater cost in upgrading it to COW)

If the src is already COW:

    COW if:
       src:LEN == 0 (shared hash key)
    || (src:CUR >= 1250 && src:LEN-src:CUR < 80 && src:LEN < src:CUR*2)
    || dst:LEN < src:CUR+1

     src:CUR
         ---
           1     COW if wont fit in dst buf
           2     COW if wont fit in dst buf
           3     COW if wont fit in dst buf
          ..
          78     COW if wont fit in dst buf
          79     COW if wont fit in dst buf
          80     COW if wont fit in dst buf
          81     COW if wont fit in dst buf
          82     COW if wont fit in dst buf
          83     COW if wont fit in dst buf
         ...
        1249     COW if wont fit in dst buf
        1250     COW always [ if src:LEN < 1330    {1562}    ]
        1251     COW always [ if src:LEN < 1331    {1563}    ]
        1252     COW always [ if src:LEN < 1332    {1565}    ]
        ....                                               
     1000000     COW always [ if src:LEN < 1000080 {1250000} ]
     1000001     COW always [ if src:LEN < 1000081 {1250001} ]

If the src is not already COW:

    COW if:
       src SV is capable of being converted to COW
          (inc 1 spare byte available for refcnt)
    && (src:CUR >= 0 && src:LEN-src:CUR < 80 && src:LEN < src:CUR*2)
    && (   (src:CUR >= 1250 && src:LEN-src:CUR < 80 && src:LEN < src:CUR*2)
        || dst:LEN < src:CUR+1
       )

     src:CUR
         ---
           1     COW if wont fit in dst buf [ and if src:LEN <   2  {24}    ]
           2     COW if wont fit in dst buf [ and if src:LEN <   4  {24}    ]
           3     COW if wont fit in dst buf [ and if src:LEN <   6  {24}    ]
          ..
          78     COW if wont fit in dst buf [ and if src:LEN < 156  {97}    ]
          79     COW if wont fit in dst buf [ and if src:LEN < 158  {98}    ]
          80     COW if wont fit in dst buf [ and if src:LEN < 160  {100}   ]
          81     COW if wont fit in dst buf [ and if src:LEN < 161  {101}   ]
          82     COW if wont fit in dst buf [ and if src:LEN < 162  {102}   ]
          83     COW if wont fit in dst buf [ and if src:LEN < 163  {103}   ]
         ...
        1249     COW if wont fit in dst buf [ and if src:LEN < 1329 {1561}  ]
        1250     COW always                 [     if src:LEN < 1330 {1562}  ]
        1251     COW always                 [     if src:LEN < 1331 {1563}  ]
        1252     COW always                 [     if src:LEN < 1332 {1565}  ]
        ....
     1000000     COW always                 [     if src:LEN < 1000080 {1250000} ]
     1000001     COW always                 [     if src:LEN < 1000081 {1250001} ]


As can be seen, for large strings the "SvLEN < SvCUR + 80" test dominates,
while for small strings, the existing "< 1250" test and/or the new "SvLEN
< SvCUR * 2" test dominate.

I think this is basically the wrong way round; I'm not sure if this was
intentional or a coding error (the && should have been a || maybe??). I
think for large strings, the test should be "SvLEN < SvCUR * N" for some
small constant N like 1.25, (but with N big enough such that
malloc_size() rounding up won't cause issues), while for small strings,
the "SvLEN < SvCUR + M" test should dominate, but I think M should be
changed to something smaller than 80. Ideally, we should round up SvLEN
using the A+nB values found from startup probing, and use that as the
limit.

So together, I propose that the criteria should be something like:
    skip COW if:
           SvLEN >= SvCUR * 1.25
        && round_up(SvLEN) >= SvCUR

With Linux/glibc/64-bit where A=24 and B=16, this would yield thresholds
that are marked with {NNN} in the tables above.


So in conclusion, I am proposing for discussion right now,
a) re-enabling malloc_size() where available, and
   probing at runtime startup to determine constants A and B where mallocs
   will be of the form A+nB;
b) adjusting the SvCUR verses SvLEN algorithm as suggested above.
   I'm not totally wedded to my exact suggested criteria 2 paragraphs
   above; I'm open to suggestions.


-- 
This is a great day for France!
    -- Nixon at Charles De Gaulle's funeral

Thread Previous | Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About