develooper Front page | perl.perl5.porters | Postings from August 2012

Seeking guidance on Benchmarking and potentially removing swash caching

Thread Next
Karl Williamson
August 19, 2012 11:15
Seeking guidance on Benchmarking and potentially removing swash caching
Message ID:
The data structure used to store and access Unicode properties and 
[bracketed character classes] in Perl regular expressions is a swash.
A swash is a hash with a few keys.  The value of one of those keys is a 
list of code points that match the swash, stored as an inversion list. 
The value of another of the keys is a sub-hash which contains a cache of 
the results for every code point that the program has so-far checked 
against the list.  I am considering changing to omit this sub-hash 
cache, so every time the result is needed, it would be re-looked up 
(using the same binary search of the inversion list that is done now for 
the first look-up).  I am seeking guidance for suitable benchmarks and 
the advisability of making this change.  I was inspired to do this now 
because it make other changes I'm in the process of making easier to do, 
and from the discussion at
which indicates that a binary search can be faster than a hash.

In this case the binary search is faster than the blog posting would 
indicate, as it is done on a C array in a C loop.

A problem with the hash cache is that it can conceivably grow very large 
with time.  Potentially, every code point representable on the machine 
could have a bit allocated for it.  This means that the upper limit of 
memory a swash could use is 2**64 bits plus keys and overhead on a 
64-bit platform; 2**32 on a 32-bit one.  More realistically, if we limit 
ourselves to code points used by Unicode, the number is 2**21; but only 
about 25% of Unicode has been allocated, which brings the likely maximum 
number down to around 2**19, which is .5Mb = 64KB.  In practice, the 
number is much smaller, as bits only get allocated for code points that 
are actually looked up.  But, if you have ever done, as I have, a loop 
in which you start at 0 and test every code point, you will eat up more 
and more memory.  This all could be solved by limiting the size of the 
cache with an LRU algorithm, but at the expense of additional overhead.

I decided to do some experimentation.  I ran a version of the Perl core 
that skips the caching against the regression test suite (make test), 
and compared the times to the standard one.  The differences were in the 

I then Benchmarked t/re/uniprops.t, which exercises the Unicode 
properties, and thus doesn't have a lot of unrelated tests that would 
mask the differences in the approaches.  The results indicate that the 
binary search alone is very very slightly faster on this set of tests.

Binary search alone: timethis 100: 1340 wallclock secs (1336.69 usr + 
1.44 sys = 1338.13 CPU) @  0.07/s (n=100)

sub-hash cache: timethis 100: 1348 wallclock secs (1345.35 usr +  1.31 
sys = 1346.66 CPU) @  0.07/s (n=100)

It is quite easy to change the binary search-only version to cache the 
last search result.  This speeds up this benchmark a little: timethis 
100: 1332 wallclock secs (1328.83 usr +  1.16 sys = 1329.99 CPU) @ 
0.08/s (n=100)

So the modified binary search looks very slightly faster (1% over the 
hash cache) on this data, and doesn't suffer from creeping memory use. 
I suspect also that the hash cache gets slower over time as more code 
points get added to it, increasing collisions.

I would prefer to benchmark more realistic real-world data, and wonder 
if anyone has some ideas on such, or any other thoughts on this issue.

Thread Next Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at | Group listing | About