develooper Front page | perl.perl5.porters | Postings from January 2012

Re: blead now has inversion lists for Unicode properties; ? AnyUnicode performance benchmarks ?

Thread Previous | Thread Next
From:
Jarkko Hietaniemi
Date:
January 17, 2012 00:40
Subject:
Re: blead now has inversion lists for Unicode properties; ? AnyUnicode performance benchmarks ?
Message ID:
CAJueppv4FJPwv8bPyUaPweW46Wya45TOE=CUfTgc4thw_t2t_w@mail.gmail.com
Great!  Someone listened to me... that feels unusual.

On Fri, Jan 13, 2012 at 8:09 PM, Karl Williamson
<public@khwilliamson.com> wrote:
> I have pushed to blead a series of commits that now use inversion lists to
> store Unicode binary properties, as Jarkko has long advocated.
>
> There are a number of changes and a number of implications, all hopefully
> invisible to the end-user except potential performance improvements.
>
> Inversion lists allow for a binary search instead of linear.  As one of the
> commits details, a program that on my box took 97 seconds now takes 1.5.
>  However, this was carefully constructed to be worst case. Programs that
> have significant amounts of ASCII or Latin1 data take much less time in the
> old scheme because those are at the beginning of binary searches.
>
> These commits move the calculation of what goes into a Unicode property to
> regex compile time instead of on-the-fly at execution time.  This can slow
> down compilation time, while speeding up execution.  (User-defined
> properties may not be known at compilation time--these are still
> execution-time determined.)  One execution speed up is that the regex
> optimizer can now analyze these properties, and reject matches that it
> previously had to assume would match.
>
> One way the old scheme (and still in the current) coped with the linear
> search performance is by permanently storing (in a hash) the results of the
> search, so that it is never performed again.  It also at the same time
> stored in the same hash value the results of searching for the 63 adjacent
> code points (a bit each), so that a request for any of them would not go
> through the linear search.  For example, if you searched to see if code
> point 130 matches a property, it would at the same time figure out whether
> or not all code points from 128 through 128+63 match the property.  This is
> not a cache as there is no provision for limiting the size of the hash.  If
> in the course of your program execution, you match code points 0, 63, 127,
> ... the hash will keep growing, and you will run out of memory at some
> point.  This is still the case.  What has changed in these commits is that
> there is now only one hash per property.  There used to be a hash for every
> instance of every property in every regular expression.  Now, all
> occurrences of \p{Any}, for example, in a thread will share the same hash,
> and running out of memory will be delayed.  This means also that if you
> search for a given code point in any one of them, the answer will be known
> to all.  Previously if you looked up "A" in one \p{Alpha}, and again in a
> second instance of \p{Alpha} later in the same regex, or a different regex,
> it would have to go through the linear search each time.  Now the results
> are known immediately to all instances, because they share the hash.
>
> I am troubled by the potential running out of memory; though apparently
> there are no field reports of it.  One solution is to turn the hashes into a
> true LRU cache which have a max size.  I don't know how to do that without
> quite a bit of overhead.
>
> An easier solution is to dispense with the hashes altogether now that we
> have much faster searching capabilities.  The results of the last search can
> be easily cached to avoid having to look up immediately again. I've also
> read that the trade-off between hashes and binary searches is not
> necessarily in favor of hashes.  It would be nice to see how this worked on
> real Unicode data.  Our regression test suite is not suitable for measuring
> this kind of performance.  I wonder if someone has ideas on good Unicode
> benchmarks?



-- 
There is this special biologist word we use for 'stable'. It is
'dead'. -- Jack Cohen

Thread Previous | Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About