With the latest changes to charnames, the last major bit of Unicode \N{name} functionality that is still missing is loose matching of names. I've brought up the issue before, but no conclusions were reached, other than a trie approach didn't work. Now I have more concrete data. I have implemented loose matching with the current data structures, and it slows down the name lookups by a factor of 2-3. The problem is that there are an awful lot of unpredictable names in Unicode, 23K+ in draft 6.0, up from 21K- in 5.2. The current implementation places them all in one large string, which is 786K in 6.0. (And this is after the recent patch which removed 1000 predictable names from the string, which are now found algorithmically. There are several hundred thousand more predictable names in Unicode which are also found algorithmically.) The string is shared by the two inverse functions: 1) looking up the ordinal given a name (\N{} at compile time; vianame() at runtime; 2) and looking up the name given the ordinal (viacode, only at runtime). I presume that's why it is done the way it is; to not have to have two large data structures. And the string is searched linearly when looking something up. Here are what I think the options are: 1) Don't implement loose matching. This doesn't appeal to me, whose main goal is to make Perl more Unicode friendly. 2) Accept the slowdown, reasoning that even if it's big, it's acceptable. When names are found, they are currently cached, so the performance penalty is incurred only the first time a name is looked up. And likely, most programs aren't going to look up huge numbers of names. But if this later turns out to be have been a bad decision, we can't revert it without breaking backward compatibility. 3) Add a new parameter to the pragma: "use charnames ':loose'", which allows the user to consciously select loose matching if they decide the slowdown is acceptable. If we later speed things up, this would become a no-op with no backwards compatibility issues. 4) Split the data structures into two: one for each direction of lookup. Each would be loaded only if needed, so there would be no space penalty for programs that look up only in one direction. In fact, the string used by \N{} and vianame would be about 6% smaller than currently, as it would have spaces and dashes squeezed out. It is my gut feeling that viacode is rarely used anyway. If we did this, then the strings could be converted to very large hashes and performance would zoom. I don't know what the overhead of such a hash would be. I would like to hear what people's opinions are.Thread Next