develooper Front page | perl.perl5.porters | Postings from August 2010

Re: Patch to make string-append on win32 100 times faster

Thread Previous | Thread Next
From:
demerphq
Date:
August 20, 2010 11:59
Subject:
Re: Patch to make string-append on win32 100 times faster
Message ID:
AANLkTi=QGxF+aOpk9pnrFXiHHPdXuY87cAsij4FAVCVF@mail.gmail.com
On 20 August 2010 19:43, Jan Dubois <jand@activestate.com> wrote:
> On Fri, 20 Aug 2010, demerphq wrote:
>> On 20 August 2010 16:53, demerphq <demerphq@gmail.com> wrote:
>>> On 16 August 2010 08:19, Jan Dubois <jand@activestate.com> wrote:
>>>> Could you provide some evidence for this claim?  The only way a
>>>> "better malloc" can prevent this slowdown is by doing some kind of
>>>> overallocation itself.
>>>
>>> This is not correct. Mallocs/reallocs that can merge blocks do not
>>> have the performance penalty that this algorithm seeks to work
>>> around. The problem here is that the Win32 realloc always copies, and
>>> thus extending a block a character at a time becomes exponential.
>>> With a realloc that merges blocks and only copies where there is
>>> insufficient contiguous blocks does not have this problem.
>>
>> Ill just note that im not arguing against this patch. Just that
>> overallocation is not the only reason that a malloc might not be
>> penalized by this change.
>
> Indeed, "the only way" was a rather poor choice of words.
>
> Note though that all the block merge algorithms still depends heavily
> on the allocation pattern.  If there are no free blocks to merge, then
> they still provide poor performance, and that case *will* be helped
> by the overallocation during the growth phase.  This has been shown
> for both FreeBSD and the Linux allocators.

IMO this is "just" common sense. If we call realloc/malloc less often
then the underlying code will be faster.  Given that the perl
tradition is to trade memory for speed in most situations this patch
makes sense -- and it /also/ provides a vector for us to provide a

  use less ''memory';

which would control how much we preallocate.

>> One real-world benchmark that people might want to try would be to use
>> a routine like this:
>>
>> sub make_tree {
>>   my ($depth) = shift;
>>   return int rand 100 unless $depth>0;
>>   return [ make_tree($depth-1), make_tree($depth-1) ]
>> }
>>
>> and then use the XS implementation of Data::Dumper to dump the results
>> of make_tree() for various N.
>>
>> On win32 even modest N will result in the machine essentially hanging.
>> On no other OS that I've tried it on is the slowdown as noticeable.
>> This was traced to the use of realloc in SV_GROW(). This was the
>> analysis that lead to Nicholas' original patch.
>
> I don't see how sv_grow() comes into this example at all.  Are you talking
> about av_expand()?

Actually no, im talking about sv_grow(). Dont forget I said to pass
the output of make_tree() into the XS version of Data::Dumper, which
will stringify the resulting binary tree by extending a buffer
essentially character by character (actually token by token, but this
example is pretty close to char by char) to add open brackets,
integers, and close brackets. It uses sv_grow to keep growing the
buffer as it does so. Interestingly on Windows you can often make
Data::Dumper go faster by NOT using the XS version by forcing it to
use the pure perl version via $Data::Dumper::Useqq or more recent
equivalents - im not sure exactly why this makes a difference, I never
dug, however Nicholas was able to demonstrate a huge speedup on some
platforms with his original hack along these lines..  Basically it
turned out that DD was calling realloc() pretty much once per
character emitted while dumping. By preallocating an extra 10% the
number of realloc calls was reduced dramatically.

Also, note that av_expand() is never called in my example as all
arrays are preallocated with 8 elements (er, i think - certainly more
than 2) and my code uses only 2 per array. The point here is that the
code will produce *many* arrays, and will require DD to do a lot of
concatenation to do the dump.

> That is indeed an area that should probably be scrutinized for potential
> Windows performance improvements too.  I also have some vague memory that
> hash expansion beyond something like 100,000 keys gets extremely slow on
> Windows.

Sounds interesting, but its not the case I meant, i was strictly
referring to the way the XS version of Data::Dumper builds the
stringified version of the object being dumped.

Cheers,
yves

-- 
perl -Mre=debug -e "/just|another|perl|hacker/"

Thread Previous | Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About