develooper Front page | perl.perl5.porters | Postings from March 2010

Re: Another optimization question: bsearch()

Thread Previous | Thread Next
From:
karl williamson
Date:
March 3, 2010 21:39
Subject:
Re: Another optimization question: bsearch()
Message ID:
4B8F4782.20805@khwilliamson.com
karl williamson wrote:
> Nicholas Clark wrote:
>> On Mon, Mar 01, 2010 at 04:45:57PM -0700, karl williamson wrote:
>>> Andy Dougherty wrote:
>>>> On Mon, 1 Mar 2010, H.Merijn Brand wrote:
>>>>
>>>>> On Mon, 01 Mar 2010 11:43:27 -0700, karl williamson
>>>>> <public@khwilliamson.com> wrote:
>>>>>
>>>>>> Jarkko a long time ago that swashes and regex character classes 
>>>>>> should be replaced by inversion lists for efficiency.
>>>>>>
>>>>>> Inversion lists can be searched binarily.  bsearch() is a standard 
>>>>>> C library routine, but the only mention of it in Perl is in the POSIX 
>>
>>>> In short, if it makes sense to use it, I think you can go ahead and 
>>>> use it.
>>
>> That's my view too.
>>
>>> That made me curious.  I have an SVR2 manual from 1983, and it's in 
>>> there, with char*.  I think the void* and const constructs were added 
>>> after that.  They're not in K&R 1st edition, but are in the 2nd, 
>>> 1988. Anyway, I doubt that we'd need a Configure probe, and this is 
>>> not a short term project; I'm just thinking ahead about the 
>>> possibilities.
>>
>> Something I've been thinking for a while is that (to my mind) it would 
>> be good
>> if the Unicode structures for the regexp engine weren't run time 
>> loaded, and
>> stored in "Perl" data structures.
>>
>> If they were instead built as const static C data structures, much 
>> like the
>> tables used by Encode, then they could be linked into perl at build time.
>> This would mean
>>
>> 1: Zero load time
>> 2: (likely) faster runtime lookup
>> 3: implicitly shared between perl processes (on most OSes)
>>
>> Nicholas Clark
>>
> 
> I thought the reason for not doing this was memory size.
> 
> The book "Unicode Demystified" says that most Unicode properties can be 
> stored in 32 bits per code point (by using various tricks including 
> exception lists for out-of-the-ordinary characters).  But that was for 
> Unicode 3.2, and there have been quite a few more properties added since 
> then.  I think all of those are binary, so I guess that all things 
> considered 96 bits per code point is an upper bound.  So the range is 
> more than 4 bytes per code point, say 6, to 12 bytes per code point. 
> Unicode has 1,114,112 code points.  A straight forward implementation 
> would require 7-14 MB.  But currently, a swath in the middle 11/16 of 
> the space is unassigned, and the last 1/8 of the space is private use, 
> so all the code points in there have the same properties as far as perl 
> is concerned, and the 1/16 just before this private use space has only 
> 500 assigned points out of 65536.  So, by playing a few games, the 
> memory could be cut by 14/16, or down to about 1-2 MB currently, using 
> more space as more characters are assigned by Unicode.  The rumor is 
> that Unicode 6.0 coming out this year will start to eat into that large 
> swath.
> 
I did a little more investigation, and found that the current need is 
more like 200 bits/code point, excluding Name, Normalization, Block, and 
a couple things that Perl doesn't currently do anything with.  And, 
another 1/16 of the space can essentially be represented by one entry. 
That comes out to about 3.5MB with a reasonably straight forward 
implementation.

Thread Previous | Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About