develooper Front page | perl.perl5.porters | Postings from July 2010

Re: RFC: space vs. time vs. functionality in \N{name} loose matching

Thread Previous | Thread Next
From:
John Imrie
Date:
July 30, 2010 12:29
Subject:
Re: RFC: space vs. time vs. functionality in \N{name} loose matching
Message ID:
4C5327AD.2040901@virginmedia.com
Arn't we over complicating this, or have I misunderstood something.

From http://www.unicode.org/reports/tr44/#Matching_Rules


        Character Names

Unicode character names constitute a special case. Formally, they are
values of the Name property. While each Unicode character name for an
assigned character is guaranteed to be unique, names are assigned in
such a way that the presence or absence of spaces cannot be used to
distinguish them. Furthermore, implementations sometimes create
identifiers from Unicode character names by inserting underscores for
spaces. For best results in comparing Unicode character names, use loose
matching rule UAX44-LM2.

/*UAX44-LM2.*/ Ignore case, whitespace, underscore ('_'), and all medial
hyphens except the hyphen in U+1180 HANGUL JUNGSEONG O-E.

    * "zero-width space" is equivalent to "ZERO WIDTH SPACE" or
      "zerowidthspace"
    * "character -a" is /not/ equivalent to "character a"

So the code in mktables needs to create names that have had the spaces
underscores and medial hyphens removed, except as noted and the result
then uppercased.

When processing the \N{ whatever } all we have to do is follow the above
rules to generate a normalized name.

I don't know where in the perl C code \N{} is processed but I hope it's
not too difficult to process this; certainty it could be written in Perl
very easily.

John

Thread Previous | Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About