Front page | perl.perl5.porters |
Postings from March 2010
Re: Another optimization question: bsearch()
Thread Previous
|
Thread Next
From:
karl williamson
Date:
March 3, 2010 08:46
Subject:
Re: Another optimization question: bsearch()
Message ID:
4B8E9237.6020601@khwilliamson.com
Nicholas Clark wrote:
> On Mon, Mar 01, 2010 at 04:45:57PM -0700, karl williamson wrote:
>> Andy Dougherty wrote:
>>> On Mon, 1 Mar 2010, H.Merijn Brand wrote:
>>>
>>>> On Mon, 01 Mar 2010 11:43:27 -0700, karl williamson
>>>> <public@khwilliamson.com> wrote:
>>>>
>>>>> Jarkko a long time ago that swashes and regex character classes should
>>>>> be replaced by inversion lists for efficiency.
>>>>>
>>>>> Inversion lists can be searched binarily. bsearch() is a standard C
>>>>> library routine, but the only mention of it in Perl is in the POSIX
>
>>> In short, if it makes sense to use it, I think you can go ahead and use
>>> it.
>
> That's my view too.
>
>> That made me curious. I have an SVR2 manual from 1983, and it's in
>> there, with char*. I think the void* and const constructs were added
>> after that. They're not in K&R 1st edition, but are in the 2nd, 1988.
>> Anyway, I doubt that we'd need a Configure probe, and this is not a
>> short term project; I'm just thinking ahead about the possibilities.
>
> Something I've been thinking for a while is that (to my mind) it would be good
> if the Unicode structures for the regexp engine weren't run time loaded, and
> stored in "Perl" data structures.
>
> If they were instead built as const static C data structures, much like the
> tables used by Encode, then they could be linked into perl at build time.
> This would mean
>
> 1: Zero load time
> 2: (likely) faster runtime lookup
> 3: implicitly shared between perl processes (on most OSes)
>
> Nicholas Clark
>
I thought the reason for not doing this was memory size.
The book "Unicode Demystified" says that most Unicode properties can be
stored in 32 bits per code point (by using various tricks including
exception lists for out-of-the-ordinary characters). But that was for
Unicode 3.2, and there have been quite a few more properties added since
then. I think all of those are binary, so I guess that all things
considered 96 bits per code point is an upper bound. So the range is
more than 4 bytes per code point, say 6, to 12 bytes per code point.
Unicode has 1,114,112 code points. A straight forward implementation
would require 7-14 MB. But currently, a swath in the middle 11/16 of
the space is unassigned, and the last 1/8 of the space is private use,
so all the code points in there have the same properties as far as perl
is concerned, and the 1/16 just before this private use space has only
500 assigned points out of 65536. So, by playing a few games, the
memory could be cut by 14/16, or down to about 1-2 MB currently, using
more space as more characters are assigned by Unicode. The rumor is
that Unicode 6.0 coming out this year will start to eat into that large
swath.
Thread Previous
|
Thread Next