develooper Front page | perl.perl5.porters | Postings from January 2012

blead now has inversion lists for Unicode properties; ? Any Unicodeperformance benchmarks ?

Thread Next
From:
Karl Williamson
Date:
January 13, 2012 11:09
Subject:
blead now has inversion lists for Unicode properties; ? Any Unicodeperformance benchmarks ?
Message ID:
4F10816A.8070306@khwilliamson.com
I have pushed to blead a series of commits that now use inversion lists 
to store Unicode binary properties, as Jarkko has long advocated.

There are a number of changes and a number of implications, all 
hopefully invisible to the end-user except potential performance 
improvements.

Inversion lists allow for a binary search instead of linear.  As one of 
the commits details, a program that on my box took 97 seconds now takes 
1.5.  However, this was carefully constructed to be worst case. 
Programs that have significant amounts of ASCII or Latin1 data take much 
less time in the old scheme because those are at the beginning of binary 
searches.

These commits move the calculation of what goes into a Unicode property 
to regex compile time instead of on-the-fly at execution time.  This can 
slow down compilation time, while speeding up execution.  (User-defined 
properties may not be known at compilation time--these are still 
execution-time determined.)  One execution speed up is that the regex 
optimizer can now analyze these properties, and reject matches that it 
previously had to assume would match.

One way the old scheme (and still in the current) coped with the linear 
search performance is by permanently storing (in a hash) the results of 
the search, so that it is never performed again.  It also at the same 
time stored in the same hash value the results of searching for the 63 
adjacent code points (a bit each), so that a request for any of them 
would not go through the linear search.  For example, if you searched to 
see if code point 130 matches a property, it would at the same time 
figure out whether or not all code points from 128 through 128+63 match 
the property.  This is not a cache as there is no provision for limiting 
the size of the hash.  If in the course of your program execution, you 
match code points 0, 63, 127, ... the hash will keep growing, and you 
will run out of memory at some point.  This is still the case.  What has 
changed in these commits is that there is now only one hash per 
property.  There used to be a hash for every instance of every property 
in every regular expression.  Now, all occurrences of \p{Any}, for 
example, in a thread will share the same hash, and running out of memory 
will be delayed.  This means also that if you search for a given code 
point in any one of them, the answer will be known to all.  Previously 
if you looked up "A" in one \p{Alpha}, and again in a second instance of 
\p{Alpha} later in the same regex, or a different regex, it would have 
to go through the linear search each time.  Now the results are known 
immediately to all instances, because they share the hash.

I am troubled by the potential running out of memory; though apparently 
there are no field reports of it.  One solution is to turn the hashes 
into a true LRU cache which have a max size.  I don't know how to do 
that without quite a bit of overhead.

An easier solution is to dispense with the hashes altogether now that we 
have much faster searching capabilities.  The results of the last search 
can be easily cached to avoid having to look up immediately again. 
I've also read that the trade-off between hashes and binary searches is 
not necessarily in favor of hashes.  It would be nice to see how this 
worked on real Unicode data.  Our regression test suite is not suitable 
for measuring this kind of performance.  I wonder if someone has ideas 
on good Unicode benchmarks?

Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About