Front page | perl.perl5.porters |
Postings from July 2010
Re: RFC: space vs. time vs. functionality in \N{name} loose matching
Thread Previous
|
Thread Next
From:
karl williamson
Date:
July 29, 2010 09:08
Subject:
Re: RFC: space vs. time vs. functionality in \N{name} loose matching
Message ID:
4C51A785.1050302@khwilliamson.com
David Golden wrote:
> I don't understand what you mean by loose matching. Can you elaborate,
> please?
Concisely, it is essentially the equivalent of a Unix 'diff -i -w', but
with most hyphens considered to be white space as well. Currently, when
you say \N{LATIN CAPITAL LETTER A}, you currently must get the spacing
exactly right and use all caps with no dashes for it to be understood.
Loose matching would have, to use an extreme example, \N{La t in Capital
letter-a} mean the same thing.
It may not matter so much on names like this one, but there are Unicode
names that have dashes and spaces in irregular combinations like a
twisty maze of little passages, or is it a maze of twisty little
passages? Unicode names weren't really well thought out, most everyone
agrees, and loose matching keeps you from having to remember if this one
has a dash where that one has a space, or no space.
>
> David
>
> On Jul 29, 2010 10:57 AM, "karl williamson" <public@khwilliamson.com
> <mailto:public@khwilliamson.com>> wrote:
> > With the latest changes to charnames, the last major bit of Unicode
> > \N{name} functionality that is still missing is loose matching of names.
> > I've brought up the issue before, but no conclusions were reached,
> > other than a trie approach didn't work.
> >
> > Now I have more concrete data. I have implemented loose matching with
> > the current data structures, and it slows down the name lookups by a
> > factor of 2-3.
> >
> > The problem is that there are an awful lot of unpredictable names in
> > Unicode, 23K+ in draft 6.0, up from 21K- in 5.2. The current
> > implementation places them all in one large string, which is 786K in
> > 6.0. (And this is after the recent patch which removed 1000 predictable
> > names from the string, which are now found algorithmically. There are
> > several hundred thousand more predictable names in Unicode which are
> > also found algorithmically.)
> >
> > The string is shared by the two inverse functions:
> > 1) looking up the ordinal given a name (\N{} at compile time;
> > vianame() at runtime;
> > 2) and looking up the name given the ordinal (viacode, only at
> > runtime).
> >
> > I presume that's why it is done the way it is; to not have to have two
> > large data structures. And the string is searched linearly when looking
> > something up.
> >
> > Here are what I think the options are:
> >
> > 1) Don't implement loose matching. This doesn't appeal to me, whose
> > main goal is to make Perl more Unicode friendly.
> >
> > 2) Accept the slowdown, reasoning that even if it's big, it's
> > acceptable. When names are found, they are currently cached, so the
> > performance penalty is incurred only the first time a name is looked up.
> > And likely, most programs aren't going to look up huge numbers of
> > names. But if this later turns out to be have been a bad decision, we
> > can't revert it without breaking backward compatibility.
> >
> > 3) Add a new parameter to the pragma: "use charnames ':loose'", which
> > allows the user to consciously select loose matching if they decide the
> > slowdown is acceptable. If we later speed things up, this would become
> > a no-op with no backwards compatibility issues.
> >
> > 4) Split the data structures into two: one for each direction of lookup.
> > Each would be loaded only if needed, so there would be no space
> > penalty for programs that look up only in one direction. In fact, the
> > string used by \N{} and vianame would be about 6% smaller than
> > currently, as it would have spaces and dashes squeezed out. It is my
> > gut feeling that viacode is rarely used anyway. If we did this, then
> > the strings could be converted to very large hashes and performance
> > would zoom. I don't know what the overhead of such a hash would be.
> >
> > I would like to hear what people's opinions are.
>
Thread Previous
|
Thread Next