develooper Front page | perl.perl5.porters | Postings from March 2010

Re: Another optimization question: bsearch()

Thread Previous | Thread Next
From:
Nicholas Clark
Date:
March 8, 2010 09:05
Subject:
Re: Another optimization question: bsearch()
Message ID:
20100308170531.GL28880@plum.flirble.org
On Wed, Mar 03, 2010 at 10:39:14PM -0700, karl williamson wrote:
> karl williamson wrote:
> >Nicholas Clark wrote:

> >>If they were instead built as const static C data structures, much 
> >>like the
> >>tables used by Encode, then they could be linked into perl at build time.
> >>This would mean
> >>
> >>1: Zero load time
> >>2: (likely) faster runtime lookup
> >>3: implicitly shared between perl processes (on most OSes)
> >>
> >>Nicholas Clark
> >>
> >
> >I thought the reason for not doing this was memory size.
> >
> >The book "Unicode Demystified" says that most Unicode properties can be 
> >stored in 32 bits per code point (by using various tricks including 
> >exception lists for out-of-the-ordinary characters).  But that was for 
> >Unicode 3.2, and there have been quite a few more properties added since 
> >then.  I think all of those are binary, so I guess that all things 
> >considered 96 bits per code point is an upper bound.  So the range is 
> >more than 4 bytes per code point, say 6, to 12 bytes per code point. 
> >Unicode has 1,114,112 code points.  A straight forward implementation 
> >would require 7-14 MB.  But currently, a swath in the middle 11/16 of 
> >the space is unassigned, and the last 1/8 of the space is private use, 
> >so all the code points in there have the same properties as far as perl 
> >is concerned, and the 1/16 just before this private use space has only 
> >500 assigned points out of 65536.  So, by playing a few games, the 
> >memory could be cut by 14/16, or down to about 1-2 MB currently, using 
> >more space as more characters are assigned by Unicode.  The rumor is 
> >that Unicode 6.0 coming out this year will start to eat into that large 
> >swath.
> >
> I did a little more investigation, and found that the current need is 
> more like 200 bits/code point, excluding Name, Normalization, Block, and 
> a couple things that Perl doesn't currently do anything with.  And, 
> another 1/16 of the space can essentially be represented by one entry. 
> That comes out to about 3.5MB with a reasonably straight forward 
> implementation.

OK cool.

It's nice to know that bitmasks would work. I was actually assuming that it
would be possible to build inversion lists as C structures at perl compile
time, with a bsearch on them, to replace our current use of HVs to store
things. (I hope I'm not too wrong on that last part)

I wonder if there's a sweet spot, with some things stored in inversion lists,
and other things store with 1 (or more) bits per code point, possibly as
a set of disjoint tables to exploit the large gap.

However, even 7-14Mb doesn't scare me. On any demand paged OS, if it's not
used, it's not paged in. I doubt that we have any platform where we're that
tight on virtual memory space for executables.

(and this is where someone proves me wrong. An initial guess - Symbian)

Nicholas Clark

Thread Previous | Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About