develooper Front page | perl.perl5.porters | Postings from July 2010

Re: RFC: space vs. time vs. functionality in \N{name} loose matching

Thread Previous | Thread Next
karl williamson
July 29, 2010 09:08
Re: RFC: space vs. time vs. functionality in \N{name} loose matching
Message ID:
David Golden wrote:
> I don't understand what you mean by loose matching.  Can you elaborate, 
> please?

Concisely, it is essentially the equivalent of a Unix 'diff -i -w', but 
with most hyphens considered to be white space as well.  Currently, when 
you say \N{LATIN CAPITAL LETTER A}, you currently must get the spacing 
exactly right and use all caps with no dashes for it to be understood. 
Loose matching would have, to use an extreme example, \N{La t in Capital 
letter-a} mean the same thing.

It may not matter so much on names like this one, but there are Unicode 
names that have dashes and spaces in irregular combinations like a 
twisty maze of little passages, or is it a maze of twisty little 
passages?  Unicode names weren't really well thought out, most everyone 
agrees, and loose matching keeps you from having to remember if this one 
has a dash where that one has a space, or no space.
> David
> On Jul 29, 2010 10:57 AM, "karl williamson" < 
> <>> wrote:
>  > With the latest changes to charnames, the last major bit of Unicode
>  > \N{name} functionality that is still missing is loose matching of names.
>  > I've brought up the issue before, but no conclusions were reached,
>  > other than a trie approach didn't work.
>  >
>  > Now I have more concrete data. I have implemented loose matching with
>  > the current data structures, and it slows down the name lookups by a
>  > factor of 2-3.
>  >
>  > The problem is that there are an awful lot of unpredictable names in
>  > Unicode, 23K+ in draft 6.0, up from 21K- in 5.2. The current
>  > implementation places them all in one large string, which is 786K in
>  > 6.0. (And this is after the recent patch which removed 1000 predictable
>  > names from the string, which are now found algorithmically. There are
>  > several hundred thousand more predictable names in Unicode which are
>  > also found algorithmically.)
>  >
>  > The string is shared by the two inverse functions:
>  > 1) looking up the ordinal given a name (\N{} at compile time;
>  > vianame() at runtime;
>  > 2) and looking up the name given the ordinal (viacode, only at
>  > runtime).
>  >
>  > I presume that's why it is done the way it is; to not have to have two
>  > large data structures. And the string is searched linearly when looking
>  > something up.
>  >
>  > Here are what I think the options are:
>  >
>  > 1) Don't implement loose matching. This doesn't appeal to me, whose
>  > main goal is to make Perl more Unicode friendly.
>  >
>  > 2) Accept the slowdown, reasoning that even if it's big, it's
>  > acceptable. When names are found, they are currently cached, so the
>  > performance penalty is incurred only the first time a name is looked up.
>  > And likely, most programs aren't going to look up huge numbers of
>  > names. But if this later turns out to be have been a bad decision, we
>  > can't revert it without breaking backward compatibility.
>  >
>  > 3) Add a new parameter to the pragma: "use charnames ':loose'", which
>  > allows the user to consciously select loose matching if they decide the
>  > slowdown is acceptable. If we later speed things up, this would become
>  > a no-op with no backwards compatibility issues.
>  >
>  > 4) Split the data structures into two: one for each direction of lookup.
>  > Each would be loaded only if needed, so there would be no space
>  > penalty for programs that look up only in one direction. In fact, the
>  > string used by \N{} and vianame would be about 6% smaller than
>  > currently, as it would have spaces and dashes squeezed out. It is my
>  > gut feeling that viacode is rarely used anyway. If we did this, then
>  > the strings could be converted to very large hashes and performance
>  > would zoom. I don't know what the overhead of such a hash would be.
>  >
>  > I would like to hear what people's opinions are.

Thread Previous | Thread Next Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at | Group listing | About