Front page | perl.perl5.porters |
Postings from July 2010
Re: RFC: space vs. time vs. functionality in \N{name} loose matching
Thread Previous
|
Thread Next
From:
karl williamson
Date:
July 30, 2010 13:18
Subject:
Re: RFC: space vs. time vs. functionality in \N{name} loose matching
Message ID:
4C533374.3070105@khwilliamson.com
John Imrie wrote:
> Arn't we over complicating this, or have I misunderstood something.
>
> >From http://www.unicode.org/reports/tr44/#Matching_Rules
>
>
> Character Names
>
> Unicode character names constitute a special case. Formally, they are
> values of the Name property. While each Unicode character name for an
> assigned character is guaranteed to be unique, names are assigned in
> such a way that the presence or absence of spaces cannot be used to
> distinguish them. Furthermore, implementations sometimes create
> identifiers from Unicode character names by inserting underscores for
> spaces. For best results in comparing Unicode character names, use loose
> matching rule UAX44-LM2.
>
> /*UAX44-LM2.*/ Ignore case, whitespace, underscore ('_'), and all medial
> hyphens except the hyphen in U+1180 HANGUL JUNGSEONG O-E.
>
> * "zero-width space" is equivalent to "ZERO WIDTH SPACE" or
> "zerowidthspace"
> * "character -a" is /not/ equivalent to "character a"
>
> So the code in mktables needs to create names that have had the spaces
> underscores and medial hyphens removed, except as noted and the result
> then uppercased.
>
> When processing the \N{ whatever } all we have to do is follow the above
> rules to generate a normalized name.
>
> I don't know where in the perl C code \N{} is processed but I hope it's
> not too difficult to process this; certainty it could be written in Perl
> very easily.
>
> John
The problem is that we have to look-up in both directions. viacode()
takes a code point number and returns the official Unicode name. We
want that official name to have the correct spaces and hyphens. We
don't want it to be "ZEROWIDTHSPACE", for example. The only reasonable
way to do this is to have the official name stored correctly. That
means we have to have a table with all the correct official names.
There's no getting around that.
What is done currently, is that that same table is used for look ups in
the other direction, for vianame() or \N{}, which take a name and find
its code point. This allows the table to be dual-purposed, but doesn't
lend itself to loose matching. Hence there is no loose matching currently.
Retaining that, i.e., do nothing, is my option 1) in the proposal.
What you suggest is my option 4). And that is to create a second table.
It would have white space and medial hyphens squeezed out. Then it
becomes a simple matter of squeezing the input name similarly and
looking for an exact match.
The downside of this option is that it requires a second huge table. I
would change mktables to generate both. Recall that we have to have a
table with the correct names in it for the viacode case. Things could
be structured so that the corresponding table is loaded only if its
function is called. That means that only programs that do look ups in
both directions would be penalized. And the lookup-by-name table is 6%
smaller than the non-squeezed one, so programs that look up only by name
would gain.
Further, once the tables were decoupled, we could do things that would
speed up performance even more. The tables could be stored as hashes,
with some overhead; or as sorted arrays for a binary search, I presume
with less overhead than the hash case.
The way I am able to do loose matching with just a single table is the
following. The table is actually a giant string, and the way things are
structured is the input is a pattern, and the code looks like
$huge_table =~ /$input/;
Then $-[0] and $+[0] are used to find where it matched.
What I can do to get loose matching is to change $input to not be just a
straight string, but to be a real pattern. So either 'DIGIT ONE' or
'd i-gito- -ne' input both would be transformed into
$input = 'D[ -]?I[ -]?G[ -]?I[ -]?T[ -]?O[ -]?N[ -]?E[ -]?'
That is what I've implemented, and slows look ups down by a factor of
2-3. This is option 2) in my proposal. But, remember that the results
are cached in a hash, so the same lookup later would avoid all this.
And finally option 3) is option 2) but only if the user included the
":loose" parameter in the pragma call.
Thread Previous
|
Thread Next