develooper Front page | perl.perl5.porters | Postings from March 2010

Re: Another optimization question: bsearch()

Thread Previous | Thread Next
karl williamson
March 11, 2010 12:13
Re: Another optimization question: bsearch()
Message ID:
Nicholas Clark wrote:
> On Wed, Mar 03, 2010 at 10:39:14PM -0700, karl williamson wrote:
>> karl williamson wrote:
>>> Nicholas Clark wrote:
>>>> If they were instead built as const static C data structures, much 
>>>> like the
>>>> tables used by Encode, then they could be linked into perl at build time.
>>>> This would mean
>>>> 1: Zero load time
>>>> 2: (likely) faster runtime lookup
>>>> 3: implicitly shared between perl processes (on most OSes)
>>>> Nicholas Clark
>>> I thought the reason for not doing this was memory size.
>>> The book "Unicode Demystified" says that most Unicode properties can be 
>>> stored in 32 bits per code point (by using various tricks including 
>>> exception lists for out-of-the-ordinary characters).  But that was for 
>>> Unicode 3.2, and there have been quite a few more properties added since 
>>> then.  I think all of those are binary, so I guess that all things 
>>> considered 96 bits per code point is an upper bound.  So the range is 
>>> more than 4 bytes per code point, say 6, to 12 bytes per code point. 
>>> Unicode has 1,114,112 code points.  A straight forward implementation 
>>> would require 7-14 MB.  But currently, a swath in the middle 11/16 of 
>>> the space is unassigned, and the last 1/8 of the space is private use, 
>>> so all the code points in there have the same properties as far as perl 
>>> is concerned, and the 1/16 just before this private use space has only 
>>> 500 assigned points out of 65536.  So, by playing a few games, the 
>>> memory could be cut by 14/16, or down to about 1-2 MB currently, using 
>>> more space as more characters are assigned by Unicode.  The rumor is 
>>> that Unicode 6.0 coming out this year will start to eat into that large 
>>> swath.
>> I did a little more investigation, and found that the current need is 
>> more like 200 bits/code point, excluding Name, Normalization, Block, and 
>> a couple things that Perl doesn't currently do anything with.  And, 
>> another 1/16 of the space can essentially be represented by one entry. 
>> That comes out to about 3.5MB with a reasonably straight forward 
>> implementation.
> OK cool.
> It's nice to know that bitmasks would work. I was actually assuming that it
> would be possible to build inversion lists as C structures at perl compile
> time, with a bsearch on them, to replace our current use of HVs to store
> things. (I hope I'm not too wrong on that last part)

I was originally talking about inversion lists instead of the current 
swashes, and they could be built as C arrays at Perl compile time.  I 
don't comprehend very well the existing mechanism, but my guess is it's 
overly complicated.  Hashes are used for caching, and for things that 
aren't simple mappings, such as multi-char case changes.  I don't know 
about regular stuff, except that when I watch an execution trace, trying 
to look things up, it looks like a lot of linear searching.
> I wonder if there's a sweet spot, with some things stored in inversion lists,
> and other things store with 1 (or more) bits per code point, possibly as
> a set of disjoint tables to exploit the large gap.

The problem with doing both is we have to implement two things instead 
of one thing.

I'm inclined to go with the arrays as long as the space is ok, as the 
speed-up appears to me to be much greater.

I haven't thought about it too much, but I did think some to come up 
with my estimate, and that estimate includes not having too many huge gaps.

For simplicity, suppose we were implementing this on a computer with a 
200 bit word size, and that is the number of bits we need.

All code points above the Unicode maximum would map to one such word.
All unassigned code points would map to another (except for the 
non-character ones, which are determinable by their ordinals alone, so 
that property wouldn't be one of the 200 bits, so the statement remains 

The Unicode space is partitioned into 17 planes of 65536 =(2**16) code 
points each.  Each plane is semi-independent.  Instead of having one 
large table with all the planes covered, one can have a table per plane, 
and for most of the planes, its table length can be one, as all code 
points in the plane have identical properties.  Further, instead of 
having a single 200-bit word, there could be several words, each 
containing related properties.  In all but plane 0 a number of those 
words would be the same for every code point in it, as for example 
Chinese doesn't have folding or casing.  My estimate didn't account for 
savings from this last step.

The bottom line is that by a slight amount of overhead, the table size 
comes down tremendously.  Other games can be played as well.
talks about some of them.

> However, even 7-14Mb doesn't scare me. On any demand paged OS, if it's not
> used, it's not paged in. I doubt that we have any platform where we're that
> tight on virtual memory space for executables.
> (and this is where someone proves me wrong. An initial guess - Symbian)
> Nicholas Clark

Thread Previous | Thread Next Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at | Group listing | About