develooper Front page | perl.perl5.porters | Postings from September 2009

RFC: what unicode mapping properties to expose

Thread Next
From:
karl williamson
Date:
September 14, 2009 07:51
Subject:
RFC: what unicode mapping properties to expose
Message ID:
4AAE5871.2070604@khwilliamson.com
I'm working on revising mktables.  The new version works on all Unicode 
releases and all character properties  (so far).

mktables generates two types of tables for properties, which I call 
matching tables and mapping tables.  It outputs files for these tables 
for later use by perl.  Matching tables are used in regular expression 
property matching (\p{...}).  These are stored in subdirectories of 
unicore/lib.  Mapping tables are used by things like uc(),and various 
packages like Unicode::Normalize, and charnames.  They are stored either 
in unicore/To, or just unicore.

We've discussed on this list before which matching tables to not expose, 
and my position is that we should expose all non-obsolete ones that 
Unicode doesn't say shouldn't be used.  Sorry for the double negative. 
There are certain properties that Unicode publishes, but says they 
shouldn't be used because they are helper properties for use by Unicode 
in constructing other properties that they do want us to use.  I'm not 
planning to expose those.

But my question this time is, what mapping tables should be exposed? 
Changing the decision is simply a matter of adding or deleting the 
property name from a list in mktables.

A Unicode character property actually is a mapping from each possible 
code point to a characteristic (or property) of that code point.  For 
example, the lc mapping of 'A' is 'a', and the script mapping of 'A' is 
Latin, while its block mapping is ASCII (or its synonym, Basic Latin). 
They are essentially functions on code points: uc('a') = 'A'.  A binary 
property in Unicode is a mapping of all code points to either Y or N.  A 
mapping table lists all the code points in ordinal order and what they 
map to.

The revised mktables constructs mapping tables for every property, and 
then uses those to construct the corresponding matching-type tables.  A 
matching table is merely a list of all the code points that map to a 
particular value.  The table for \p{Script=Greek}, for example lists all 
characters whose script mapping is Greek.  In perl, \p{IsUppercase} is 
actually a synonym for what Unicode would write as \p{Uppercase=Y}, that 
is all characters whose 'Uppercase' mapping is Y, which means all those 
that are uppercase letters.

So, the mapping tables are constructed, but currently very few of them 
are actually written into files, only the ones that some other piece of 
code actually uses.  Now, if more of these files had been around 
earlier, it would have been somewhat easier to write various modules 
like those in Unicode::.  Currently, there is no way to access any of 
these tables directly through perl.  Someone has to write code that 
explicitly reads them in, like ucfirst() and cousins do.

One could take the stance that "if you build it, they will come", and 
expose all the tables so that they are ready-made for future use.  But I 
don't actually think it makes sense to expose the binary property 
mapping tables, because it's just as easy to use the \p{...} to get full 
information from them.  And is anyone ever going to want to use the 
"simple" casing tables, when perl already uses the "full" ones, which 
give more accurate results?  And I can't see anyone using the 
ISO_Comment table, which Unicode is removing all data from anyway in 5.2?

My current proposal is to newly expose just those properties that aren't 
otherwise accessible through perl and not shadows of more complete 
properties (like the simple vs full), and not the one that is going away 
in 5.2.  This results in just a couple in 5.1 dealing with normalization 
and mirroring, and that someone might want to use sometime.

On the other hand, for debugging purposes, I'm writing out all the 
non-binary tables, and I've found it convenient to be able to quickly 
eyeball what script, for example, a character is in.  That information 
is available, inconveniently, through \p{...}, and so isn't in my 
proposal above.  I could easily continue to write out these properties 
as well.

Also, Unicode suggests rather fancier property matching expressions than 
perl supports.  Here are some examples from their TR18, using posix 
notation:
[[:name=/CJK/:]-[:ideographic:]]  	The set of all characters with 
                                                              	 
                        names that contain CJK that are  	 
                   not Ideographic
[:name=/\bDOT$/:] 			The set of all characters with 			 
                names that end with the word DOT
[:block=/(?i)arab/:] 	                The set of all characters in 			 
                               blocks that contain the sequence  	 
                           of letters "arab" case- 			 
               insensitive)
[:toNFKC=/\./:]                         the set of all characters 
with 			                                toNFKC values that contain a 		 
                                 literal period

The last is using a mapping table (because of the 'to').

If we were to ever expand to support things like this at some point in 
the future, we would need these mapping tables.  But then we could 
simply change mktables to write them at that time.

Any comments?

Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About