develooper Front page | perl.perl5.porters | Postings from March 2010

Re: Another optimization question: bsearch()

Thread Previous | Thread Next
karl williamson
March 3, 2010 08:46
Re: Another optimization question: bsearch()
Message ID:
Nicholas Clark wrote:
> On Mon, Mar 01, 2010 at 04:45:57PM -0700, karl williamson wrote:
>> Andy Dougherty wrote:
>>> On Mon, 1 Mar 2010, H.Merijn Brand wrote:
>>>> On Mon, 01 Mar 2010 11:43:27 -0700, karl williamson
>>>> <> wrote:
>>>>> Jarkko a long time ago that swashes and regex character classes should 
>>>>> be replaced by inversion lists for efficiency.
>>>>> Inversion lists can be searched binarily.  bsearch() is a standard C 
>>>>> library routine, but the only mention of it in Perl is in the POSIX 
>>> In short, if it makes sense to use it, I think you can go ahead and use 
>>> it.
> That's my view too.
>> That made me curious.  I have an SVR2 manual from 1983, and it's in 
>> there, with char*.  I think the void* and const constructs were added 
>> after that.  They're not in K&R 1st edition, but are in the 2nd, 1988. 
>> Anyway, I doubt that we'd need a Configure probe, and this is not a 
>> short term project; I'm just thinking ahead about the possibilities.
> Something I've been thinking for a while is that (to my mind) it would be good
> if the Unicode structures for the regexp engine weren't run time loaded, and
> stored in "Perl" data structures.
> If they were instead built as const static C data structures, much like the
> tables used by Encode, then they could be linked into perl at build time.
> This would mean
> 1: Zero load time
> 2: (likely) faster runtime lookup
> 3: implicitly shared between perl processes (on most OSes)
> Nicholas Clark

I thought the reason for not doing this was memory size.

The book "Unicode Demystified" says that most Unicode properties can be 
stored in 32 bits per code point (by using various tricks including 
exception lists for out-of-the-ordinary characters).  But that was for 
Unicode 3.2, and there have been quite a few more properties added since 
then.  I think all of those are binary, so I guess that all things 
considered 96 bits per code point is an upper bound.  So the range is 
more than 4 bytes per code point, say 6, to 12 bytes per code point. 
Unicode has 1,114,112 code points.  A straight forward implementation 
would require 7-14 MB.  But currently, a swath in the middle 11/16 of 
the space is unassigned, and the last 1/8 of the space is private use, 
so all the code points in there have the same properties as far as perl 
is concerned, and the 1/16 just before this private use space has only 
500 assigned points out of 65536.  So, by playing a few games, the 
memory could be cut by 14/16, or down to about 1-2 MB currently, using 
more space as more characters are assigned by Unicode.  The rumor is 
that Unicode 6.0 coming out this year will start to eat into that large 

Thread Previous | Thread Next Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at | Group listing | About