develooper Front page | perl.perl5.porters | Postings from July 2010

Re: RFC: space vs. time vs. functionality in \N{name} loose matching

Thread Previous | Thread Next
karl williamson
July 30, 2010 13:18
Re: RFC: space vs. time vs. functionality in \N{name} loose matching
Message ID:
John Imrie wrote:
> Arn't we over complicating this, or have I misunderstood something.
>  >From
>         Character Names
> Unicode character names constitute a special case. Formally, they are 
> values of the Name property. While each Unicode character name for an 
> assigned character is guaranteed to be unique, names are assigned in 
> such a way that the presence or absence of spaces cannot be used to 
> distinguish them. Furthermore, implementations sometimes create 
> identifiers from Unicode character names by inserting underscores for 
> spaces. For best results in comparing Unicode character names, use loose 
> matching rule UAX44-LM2.
> /*UAX44-LM2.*/ Ignore case, whitespace, underscore ('_'), and all medial 
> hyphens except the hyphen in U+1180 HANGUL JUNGSEONG O-E.
>     * "zero-width space" is equivalent to "ZERO WIDTH SPACE" or
>       "zerowidthspace"
>     * "character -a" is /not/ equivalent to "character a"
> So the code in mktables needs to create names that have had the spaces 
> underscores and medial hyphens removed, except as noted and the result 
> then uppercased.
> When processing the \N{ whatever } all we have to do is follow the above 
> rules to generate a normalized name.
> I don't know where in the perl C code \N{} is processed but I hope it's 
> not too difficult to process this; certainty it could be written in Perl 
> very easily.
> John

The problem is that we have to look-up in both directions.  viacode() 
takes a code point number and returns the official Unicode name.  We 
want that official name to have the correct spaces and hyphens.  We 
don't want it to be "ZEROWIDTHSPACE", for example.  The only reasonable 
way to do this is to have the official name stored correctly.  That 
means we have to have a table with all the correct official names. 
There's no getting around that.

What is done currently, is that that same table is used for look ups in 
the other direction, for vianame() or \N{}, which take a name and find 
its code point.  This allows the table to be dual-purposed, but doesn't 
lend itself to loose matching.  Hence there is no loose matching currently.

Retaining that, i.e., do nothing, is my option 1) in the proposal.

What you suggest is my option 4).  And that is to create a second table. 
  It would have white space and medial hyphens squeezed out.  Then it 
becomes a simple matter of squeezing the input name similarly and 
looking for an exact match.

The downside of this option is that it requires a second huge table.  I 
would change mktables to generate both.  Recall that we have to have a 
table with the correct names in it for the viacode case.  Things could 
be structured so that the corresponding table is loaded only if its 
function is called.  That means that only programs that do look ups in 
both directions would be penalized.  And the lookup-by-name table is 6% 
smaller than the non-squeezed one, so programs that look up only by name 
would gain.

Further, once the tables were decoupled, we could do things that would 
speed up performance even more.  The tables could be stored as hashes, 
with some overhead; or as sorted arrays for a binary search, I presume 
with less overhead than the hash case.

The way I am able to do loose matching with just a single table is the 
following.  The table is actually a giant string, and the way things are 
structured is the input is a pattern, and the code looks like
   $huge_table =~ /$input/;
Then $-[0] and $+[0] are used to find where it matched.
What I can do to get loose matching is to change $input to not be just a 
straight string, but to be a real pattern.  So either 'DIGIT ONE' or
'd i-gito- -ne' input both would be transformed into
  $input = 'D[ -]?I[ -]?G[ -]?I[ -]?T[ -]?O[ -]?N[ -]?E[ -]?'
That is what I've implemented, and slows look ups down by a factor of 
2-3.  This is option 2) in my proposal.  But, remember that the results 
are cached in a hash, so the same lookup later would avoid all this.

And finally option 3) is option 2) but only if the user included the 
":loose" parameter in the pragma call.

Thread Previous | Thread Next Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at | Group listing | About