Front page | perl.perl5.porters |
Postings from March 2010
Re: Another optimization question: bsearch()
Thread Previous
|
Thread Next
From:
Nicholas Clark
Date:
March 8, 2010 09:05
Subject:
Re: Another optimization question: bsearch()
Message ID:
20100308170531.GL28880@plum.flirble.org
On Wed, Mar 03, 2010 at 10:39:14PM -0700, karl williamson wrote:
> karl williamson wrote:
> >Nicholas Clark wrote:
> >>If they were instead built as const static C data structures, much
> >>like the
> >>tables used by Encode, then they could be linked into perl at build time.
> >>This would mean
> >>
> >>1: Zero load time
> >>2: (likely) faster runtime lookup
> >>3: implicitly shared between perl processes (on most OSes)
> >>
> >>Nicholas Clark
> >>
> >
> >I thought the reason for not doing this was memory size.
> >
> >The book "Unicode Demystified" says that most Unicode properties can be
> >stored in 32 bits per code point (by using various tricks including
> >exception lists for out-of-the-ordinary characters). But that was for
> >Unicode 3.2, and there have been quite a few more properties added since
> >then. I think all of those are binary, so I guess that all things
> >considered 96 bits per code point is an upper bound. So the range is
> >more than 4 bytes per code point, say 6, to 12 bytes per code point.
> >Unicode has 1,114,112 code points. A straight forward implementation
> >would require 7-14 MB. But currently, a swath in the middle 11/16 of
> >the space is unassigned, and the last 1/8 of the space is private use,
> >so all the code points in there have the same properties as far as perl
> >is concerned, and the 1/16 just before this private use space has only
> >500 assigned points out of 65536. So, by playing a few games, the
> >memory could be cut by 14/16, or down to about 1-2 MB currently, using
> >more space as more characters are assigned by Unicode. The rumor is
> >that Unicode 6.0 coming out this year will start to eat into that large
> >swath.
> >
> I did a little more investigation, and found that the current need is
> more like 200 bits/code point, excluding Name, Normalization, Block, and
> a couple things that Perl doesn't currently do anything with. And,
> another 1/16 of the space can essentially be represented by one entry.
> That comes out to about 3.5MB with a reasonably straight forward
> implementation.
OK cool.
It's nice to know that bitmasks would work. I was actually assuming that it
would be possible to build inversion lists as C structures at perl compile
time, with a bsearch on them, to replace our current use of HVs to store
things. (I hope I'm not too wrong on that last part)
I wonder if there's a sweet spot, with some things stored in inversion lists,
and other things store with 1 (or more) bits per code point, possibly as
a set of disjoint tables to exploit the large gap.
However, even 7-14Mb doesn't scare me. On any demand paged OS, if it's not
used, it's not paged in. I doubt that we have any platform where we're that
tight on virtual memory space for executables.
(and this is where someone proves me wrong. An initial guess - Symbian)
Nicholas Clark
Thread Previous
|
Thread Next