On 01/18/2012 05:06 AM, Konovalov, Vadim (Vadim)** CTR ** wrote: >> From: Karl Williamson > >> One way the old scheme (and still in the current) coped with >> the linear >> search performance is by permanently storing (in a hash) the >> results of >> the search, so that it is never performed again. It also at the same >> time stored in the same hash value the results of searching >> for the 63 >> adjacent code points (a bit each), so that a request for any of them >> would not go through the linear search. For example, if you >> searched to >> see if code point 130 matches a property, it would at the same time >> figure out whether or not all code points from 128 through >> 128+63 match >> the property. This is not a cache as there is no provision >> for limiting >> the size of the hash. If in the course of your program >> execution, you >> match code points 0, 63, 127, ... the hash will keep growing, and you >> will run out of memory at some point. This is still the >> case. > > why memory exhaustion could ever happen here? > > Is the given hash - per regular expression? Until these patches the hash was per every single Unicode property instance in every regular expression. Now there is a hash per Unicode property no matter what regular expression it appears in. In other words, if you have /\p{Any}/, and later in the program /\p{Any}abc\p{Word}/, there would be a hash for Any, and another hash for Word. Before these changes there would be 3 hashes, including one for each 'Any' instance. > > As far as I understood it - this is a hash of codepoint by something else, > hence it will never grow bigger than number of all codepoints in worst > case (for even huge strings to match). > > So - as I see your explanation - it will stop growing at some point.. This is a hash of code points. And yes, it does stop growing. Each entry covers 2**6 code points. Theoretically the worst case on the Unicode-defined code points is 2**15 entries, which is 32K entries. That is per property. But if above-Unicode code points are searched for, the hash will grow to accommodate them. So for a hash on a 32 bit machine, the worst case is 2**26 entries = 67Mb, and on a 64 bit machine, 2**58. So, I'm sorry, I did exaggerate the problem somewhat. It is theoretically possible to run out of memory, but unlikely in real world applications. > > Otherwise - can you please just point me directly into Perl source > any usage of this hash, so I will have an idea on how it is constructed? The hash is built by swash_get() in utf8.c > > Vadim. >Thread Previous | Thread Next