develooper Front page | perl.perl5.porters | Postings from July 2010

RFC: space vs. time vs. functionality in \N{name} loose matching

Thread Next
From:
karl williamson
Date:
July 29, 2010 07:56
Subject:
RFC: space vs. time vs. functionality in \N{name} loose matching
Message ID:
4C51967E.8090907@khwilliamson.com
With the latest changes to charnames, the last major bit of Unicode 
\N{name} functionality that is still missing is loose matching of names. 
  I've brought up the issue before, but no conclusions were reached, 
other than a trie approach didn't work.

Now I have more concrete data.  I have implemented loose matching with 
the current data structures, and it slows down the name lookups by a 
factor of 2-3.

The problem is that there are an awful lot of unpredictable names in 
Unicode, 23K+ in draft 6.0, up from 21K- in 5.2.  The current 
implementation places them all in one large string, which is 786K in 
6.0.  (And this is after the recent patch which removed 1000 predictable 
names from the string, which are now found algorithmically.  There are 
several hundred thousand more predictable names in Unicode which are 
also found algorithmically.)

The string is shared by the two inverse functions:
     1) looking up the ordinal given a name (\N{} at compile time; 
vianame() at runtime;
     2) and looking up the name given the ordinal (viacode, only at 
runtime).

I presume that's why it is done the way it is; to not have to have two 
large data structures.  And the string is searched linearly when looking 
something up.

Here are what I think the options are:

1) Don't implement loose matching.  This doesn't appeal to me, whose 
main goal is to make Perl more Unicode friendly.

2) Accept the slowdown, reasoning that even if it's big, it's 
acceptable.  When names are found, they are currently cached, so the 
performance penalty is incurred only the first time a name is looked up. 
  And likely, most programs aren't going to look up huge numbers of 
names.  But if this later turns out to be have been a bad decision, we 
can't revert it without breaking backward compatibility.

3) Add a new parameter to the pragma: "use charnames ':loose'", which 
allows the user to consciously select loose matching if they decide the 
slowdown is acceptable.  If we later speed things up, this would become 
a no-op with no backwards compatibility issues.

4) Split the data structures into two: one for each direction of lookup. 
  Each would be loaded only if needed, so there would be no space 
penalty for programs that look up only in one direction.  In fact, the 
string used by \N{} and vianame would be about 6% smaller than 
currently, as it would have spaces and dashes squeezed out.  It is my 
gut feeling that viacode is rarely used anyway.  If we did this, then 
the strings could be converted to very large hashes and performance 
would zoom.  I don't know what the overhead of such a hash would be.

I would like to hear what people's opinions are.

Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About