Front page | perl.perl5.porters |
Postings from July 2010
Re: RFC: space vs. time vs. functionality in \N{name} loose matching
Thread Previous
|
Thread Next
From:
David Golden
Date:
July 29, 2010 08:06
Subject:
Re: RFC: space vs. time vs. functionality in \N{name} loose matching
Message ID:
AANLkTi=2vP=_b4eBfx4gHx4X_6_HyukBfFhKwsUvVgz4@mail.gmail.com
I don't understand what you mean by loose matching. Can you elaborate,
please?
David
On Jul 29, 2010 10:57 AM, "karl williamson" <public@khwilliamson.com> wrote:
> With the latest changes to charnames, the last major bit of Unicode
> \N{name} functionality that is still missing is loose matching of names.
> I've brought up the issue before, but no conclusions were reached,
> other than a trie approach didn't work.
>
> Now I have more concrete data. I have implemented loose matching with
> the current data structures, and it slows down the name lookups by a
> factor of 2-3.
>
> The problem is that there are an awful lot of unpredictable names in
> Unicode, 23K+ in draft 6.0, up from 21K- in 5.2. The current
> implementation places them all in one large string, which is 786K in
> 6.0. (And this is after the recent patch which removed 1000 predictable
> names from the string, which are now found algorithmically. There are
> several hundred thousand more predictable names in Unicode which are
> also found algorithmically.)
>
> The string is shared by the two inverse functions:
> 1) looking up the ordinal given a name (\N{} at compile time;
> vianame() at runtime;
> 2) and looking up the name given the ordinal (viacode, only at
> runtime).
>
> I presume that's why it is done the way it is; to not have to have two
> large data structures. And the string is searched linearly when looking
> something up.
>
> Here are what I think the options are:
>
> 1) Don't implement loose matching. This doesn't appeal to me, whose
> main goal is to make Perl more Unicode friendly.
>
> 2) Accept the slowdown, reasoning that even if it's big, it's
> acceptable. When names are found, they are currently cached, so the
> performance penalty is incurred only the first time a name is looked up.
> And likely, most programs aren't going to look up huge numbers of
> names. But if this later turns out to be have been a bad decision, we
> can't revert it without breaking backward compatibility.
>
> 3) Add a new parameter to the pragma: "use charnames ':loose'", which
> allows the user to consciously select loose matching if they decide the
> slowdown is acceptable. If we later speed things up, this would become
> a no-op with no backwards compatibility issues.
>
> 4) Split the data structures into two: one for each direction of lookup.
> Each would be loaded only if needed, so there would be no space
> penalty for programs that look up only in one direction. In fact, the
> string used by \N{} and vianame would be about 6% smaller than
> currently, as it would have spaces and dashes squeezed out. It is my
> gut feeling that viacode is rarely used anyway. If we did this, then
> the strings could be converted to very large hashes and performance
> would zoom. I don't know what the overhead of such a hash would be.
>
> I would like to hear what people's opinions are.
Thread Previous
|
Thread Next