develooper Front page | perl.perl5.porters | Postings from January 2012

Re: blead now has inversion lists for Unicode properties; ? Any Unicodeperformance benchmarks ?

Thread Previous | Thread Next
Karl Williamson
January 18, 2012 09:27
Re: blead now has inversion lists for Unicode properties; ? Any Unicodeperformance benchmarks ?
Message ID:
On 01/18/2012 05:06 AM, Konovalov, Vadim (Vadim)** CTR ** wrote:
>> From: Karl Williamson
>> One way the old scheme (and still in the current) coped with
>> the linear
>> search performance is by permanently storing (in a hash) the
>> results of
>> the search, so that it is never performed again.  It also at the same
>> time stored in the same hash value the results of searching
>> for the 63
>> adjacent code points (a bit each), so that a request for any of them
>> would not go through the linear search.  For example, if you
>> searched to
>> see if code point 130 matches a property, it would at the same time
>> figure out whether or not all code points from 128 through
>> 128+63 match
>> the property.  This is not a cache as there is no provision
>> for limiting
>> the size of the hash.  If in the course of your program
>> execution, you
>> match code points 0, 63, 127, ... the hash will keep growing, and you
>> will run out of memory at some point.  This is still the
>> case.
> why memory exhaustion could ever happen here?
> Is the given hash - per regular expression?

Until these patches the hash was per every single Unicode property 
instance in every regular expression.  Now there is a hash per Unicode 
property no matter what regular expression it appears in.  In other 
words, if you have /\p{Any}/, and later in the program
/\p{Any}abc\p{Word}/, there would be a hash for Any, and another hash 
for Word.  Before these changes there would be 3 hashes, including one 
for each 'Any' instance.
> As far as I understood it - this is a hash of codepoint by something else,
> hence it will never grow bigger than number of all codepoints in worst
> case (for even huge strings to match).

> So - as I see your explanation - it will stop growing at some point..

This is a hash of code points.  And yes, it does stop growing.  Each 
entry covers 2**6 code points.  Theoretically the worst case on the 
Unicode-defined code points is 2**15 entries, which is 32K entries. 
That is per property.  But if above-Unicode code points are searched 
for, the hash will grow to accommodate them.  So for a hash on a 32 bit 
machine, the worst case is 2**26 entries = 67Mb, and on a 64 bit 
machine, 2**58.  So, I'm sorry, I did exaggerate the problem somewhat. 
It is theoretically possible to run out of memory, but unlikely in real 
world applications.
> Otherwise - can you please just point me directly into Perl source
> any usage of this hash, so I will have an idea on how it is constructed?

The hash is built by swash_get() in utf8.c
> Vadim.

Thread Previous | Thread Next Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at | Group listing | About