Front page | perl.perl5.porters |
Postings from January 2012
Re: blead now has inversion lists for Unicode properties; ? AnyUnicode performance benchmarks ?
Thread Previous
|
Thread Next
From:
Jarkko Hietaniemi
Date:
January 17, 2012 00:40
Subject:
Re: blead now has inversion lists for Unicode properties; ? AnyUnicode performance benchmarks ?
Message ID:
CAJueppv4FJPwv8bPyUaPweW46Wya45TOE=CUfTgc4thw_t2t_w@mail.gmail.com
Great! Someone listened to me... that feels unusual.
On Fri, Jan 13, 2012 at 8:09 PM, Karl Williamson
<public@khwilliamson.com> wrote:
> I have pushed to blead a series of commits that now use inversion lists to
> store Unicode binary properties, as Jarkko has long advocated.
>
> There are a number of changes and a number of implications, all hopefully
> invisible to the end-user except potential performance improvements.
>
> Inversion lists allow for a binary search instead of linear. As one of the
> commits details, a program that on my box took 97 seconds now takes 1.5.
> However, this was carefully constructed to be worst case. Programs that
> have significant amounts of ASCII or Latin1 data take much less time in the
> old scheme because those are at the beginning of binary searches.
>
> These commits move the calculation of what goes into a Unicode property to
> regex compile time instead of on-the-fly at execution time. This can slow
> down compilation time, while speeding up execution. (User-defined
> properties may not be known at compilation time--these are still
> execution-time determined.) One execution speed up is that the regex
> optimizer can now analyze these properties, and reject matches that it
> previously had to assume would match.
>
> One way the old scheme (and still in the current) coped with the linear
> search performance is by permanently storing (in a hash) the results of the
> search, so that it is never performed again. It also at the same time
> stored in the same hash value the results of searching for the 63 adjacent
> code points (a bit each), so that a request for any of them would not go
> through the linear search. For example, if you searched to see if code
> point 130 matches a property, it would at the same time figure out whether
> or not all code points from 128 through 128+63 match the property. This is
> not a cache as there is no provision for limiting the size of the hash. If
> in the course of your program execution, you match code points 0, 63, 127,
> ... the hash will keep growing, and you will run out of memory at some
> point. This is still the case. What has changed in these commits is that
> there is now only one hash per property. There used to be a hash for every
> instance of every property in every regular expression. Now, all
> occurrences of \p{Any}, for example, in a thread will share the same hash,
> and running out of memory will be delayed. This means also that if you
> search for a given code point in any one of them, the answer will be known
> to all. Previously if you looked up "A" in one \p{Alpha}, and again in a
> second instance of \p{Alpha} later in the same regex, or a different regex,
> it would have to go through the linear search each time. Now the results
> are known immediately to all instances, because they share the hash.
>
> I am troubled by the potential running out of memory; though apparently
> there are no field reports of it. One solution is to turn the hashes into a
> true LRU cache which have a max size. I don't know how to do that without
> quite a bit of overhead.
>
> An easier solution is to dispense with the hashes altogether now that we
> have much faster searching capabilities. The results of the last search can
> be easily cached to avoid having to look up immediately again. I've also
> read that the trade-off between hashes and binary searches is not
> necessarily in favor of hashes. It would be nice to see how this worked on
> real Unicode data. Our regression test suite is not suitable for measuring
> this kind of performance. I wonder if someone has ideas on good Unicode
> benchmarks?
--
There is this special biologist word we use for 'stable'. It is
'dead'. -- Jack Cohen
Thread Previous
|
Thread Next