With the latest changes to charnames, the last major bit of Unicode
\N{name} functionality that is still missing is loose matching of names.
I've brought up the issue before, but no conclusions were reached,
other than a trie approach didn't work.
Now I have more concrete data. I have implemented loose matching with
the current data structures, and it slows down the name lookups by a
factor of 2-3.
The problem is that there are an awful lot of unpredictable names in
Unicode, 23K+ in draft 6.0, up from 21K- in 5.2. The current
implementation places them all in one large string, which is 786K in
6.0. (And this is after the recent patch which removed 1000 predictable
names from the string, which are now found algorithmically. There are
several hundred thousand more predictable names in Unicode which are
also found algorithmically.)
The string is shared by the two inverse functions:
1) looking up the ordinal given a name (\N{} at compile time;
vianame() at runtime;
2) and looking up the name given the ordinal (viacode, only at
runtime).
I presume that's why it is done the way it is; to not have to have two
large data structures. And the string is searched linearly when looking
something up.
Here are what I think the options are:
1) Don't implement loose matching. This doesn't appeal to me, whose
main goal is to make Perl more Unicode friendly.
2) Accept the slowdown, reasoning that even if it's big, it's
acceptable. When names are found, they are currently cached, so the
performance penalty is incurred only the first time a name is looked up.
And likely, most programs aren't going to look up huge numbers of
names. But if this later turns out to be have been a bad decision, we
can't revert it without breaking backward compatibility.
3) Add a new parameter to the pragma: "use charnames ':loose'", which
allows the user to consciously select loose matching if they decide the
slowdown is acceptable. If we later speed things up, this would become
a no-op with no backwards compatibility issues.
4) Split the data structures into two: one for each direction of lookup.
Each would be loaded only if needed, so there would be no space
penalty for programs that look up only in one direction. In fact, the
string used by \N{} and vianame would be about 6% smaller than
currently, as it would have spaces and dashes squeezed out. It is my
gut feeling that viacode is rarely used anyway. If we did this, then
the strings could be converted to very large hashes and performance
would zoom. I don't know what the overhead of such a hash would be.
I would like to hear what people's opinions are.
Thread Next