I have pushed to blead a series of commits that now use inversion lists to store Unicode binary properties, as Jarkko has long advocated. There are a number of changes and a number of implications, all hopefully invisible to the end-user except potential performance improvements. Inversion lists allow for a binary search instead of linear. As one of the commits details, a program that on my box took 97 seconds now takes 1.5. However, this was carefully constructed to be worst case. Programs that have significant amounts of ASCII or Latin1 data take much less time in the old scheme because those are at the beginning of binary searches. These commits move the calculation of what goes into a Unicode property to regex compile time instead of on-the-fly at execution time. This can slow down compilation time, while speeding up execution. (User-defined properties may not be known at compilation time--these are still execution-time determined.) One execution speed up is that the regex optimizer can now analyze these properties, and reject matches that it previously had to assume would match. One way the old scheme (and still in the current) coped with the linear search performance is by permanently storing (in a hash) the results of the search, so that it is never performed again. It also at the same time stored in the same hash value the results of searching for the 63 adjacent code points (a bit each), so that a request for any of them would not go through the linear search. For example, if you searched to see if code point 130 matches a property, it would at the same time figure out whether or not all code points from 128 through 128+63 match the property. This is not a cache as there is no provision for limiting the size of the hash. If in the course of your program execution, you match code points 0, 63, 127, ... the hash will keep growing, and you will run out of memory at some point. This is still the case. What has changed in these commits is that there is now only one hash per property. There used to be a hash for every instance of every property in every regular expression. Now, all occurrences of \p{Any}, for example, in a thread will share the same hash, and running out of memory will be delayed. This means also that if you search for a given code point in any one of them, the answer will be known to all. Previously if you looked up "A" in one \p{Alpha}, and again in a second instance of \p{Alpha} later in the same regex, or a different regex, it would have to go through the linear search each time. Now the results are known immediately to all instances, because they share the hash. I am troubled by the potential running out of memory; though apparently there are no field reports of it. One solution is to turn the hashes into a true LRU cache which have a max size. I don't know how to do that without quite a bit of overhead. An easier solution is to dispense with the hashes altogether now that we have much faster searching capabilities. The results of the last search can be easily cached to avoid having to look up immediately again. I've also read that the trade-off between hashes and binary searches is not necessarily in favor of hashes. It would be nice to see how this worked on real Unicode data. Our regression test suite is not suitable for measuring this kind of performance. I wonder if someone has ideas on good Unicode benchmarks?Thread Next