Front page | perl.perl5.porters |
Postings from March 2010
Re: Another optimization question: bsearch()
Thread Previous
|
Thread Next
From:
karl williamson
Date:
March 3, 2010 21:39
Subject:
Re: Another optimization question: bsearch()
Message ID:
4B8F4782.20805@khwilliamson.com
karl williamson wrote:
> Nicholas Clark wrote:
>> On Mon, Mar 01, 2010 at 04:45:57PM -0700, karl williamson wrote:
>>> Andy Dougherty wrote:
>>>> On Mon, 1 Mar 2010, H.Merijn Brand wrote:
>>>>
>>>>> On Mon, 01 Mar 2010 11:43:27 -0700, karl williamson
>>>>> <public@khwilliamson.com> wrote:
>>>>>
>>>>>> Jarkko a long time ago that swashes and regex character classes
>>>>>> should be replaced by inversion lists for efficiency.
>>>>>>
>>>>>> Inversion lists can be searched binarily. bsearch() is a standard
>>>>>> C library routine, but the only mention of it in Perl is in the POSIX
>>
>>>> In short, if it makes sense to use it, I think you can go ahead and
>>>> use it.
>>
>> That's my view too.
>>
>>> That made me curious. I have an SVR2 manual from 1983, and it's in
>>> there, with char*. I think the void* and const constructs were added
>>> after that. They're not in K&R 1st edition, but are in the 2nd,
>>> 1988. Anyway, I doubt that we'd need a Configure probe, and this is
>>> not a short term project; I'm just thinking ahead about the
>>> possibilities.
>>
>> Something I've been thinking for a while is that (to my mind) it would
>> be good
>> if the Unicode structures for the regexp engine weren't run time
>> loaded, and
>> stored in "Perl" data structures.
>>
>> If they were instead built as const static C data structures, much
>> like the
>> tables used by Encode, then they could be linked into perl at build time.
>> This would mean
>>
>> 1: Zero load time
>> 2: (likely) faster runtime lookup
>> 3: implicitly shared between perl processes (on most OSes)
>>
>> Nicholas Clark
>>
>
> I thought the reason for not doing this was memory size.
>
> The book "Unicode Demystified" says that most Unicode properties can be
> stored in 32 bits per code point (by using various tricks including
> exception lists for out-of-the-ordinary characters). But that was for
> Unicode 3.2, and there have been quite a few more properties added since
> then. I think all of those are binary, so I guess that all things
> considered 96 bits per code point is an upper bound. So the range is
> more than 4 bytes per code point, say 6, to 12 bytes per code point.
> Unicode has 1,114,112 code points. A straight forward implementation
> would require 7-14 MB. But currently, a swath in the middle 11/16 of
> the space is unassigned, and the last 1/8 of the space is private use,
> so all the code points in there have the same properties as far as perl
> is concerned, and the 1/16 just before this private use space has only
> 500 assigned points out of 65536. So, by playing a few games, the
> memory could be cut by 14/16, or down to about 1-2 MB currently, using
> more space as more characters are assigned by Unicode. The rumor is
> that Unicode 6.0 coming out this year will start to eat into that large
> swath.
>
I did a little more investigation, and found that the current need is
more like 200 bits/code point, excluding Name, Normalization, Block, and
a couple things that Perl doesn't currently do anything with. And,
another 1/16 of the space can essentially be represented by one entry.
That comes out to about 3.5MB with a reasonably straight forward
implementation.
Thread Previous
|
Thread Next