Front page | perl.perl5.porters |
Postings from September 2009
RFC: what unicode mapping properties to expose
Thread Next
From:
karl williamson
Date:
September 14, 2009 07:51
Subject:
RFC: what unicode mapping properties to expose
Message ID:
4AAE5871.2070604@khwilliamson.com
I'm working on revising mktables. The new version works on all Unicode
releases and all character properties (so far).
mktables generates two types of tables for properties, which I call
matching tables and mapping tables. It outputs files for these tables
for later use by perl. Matching tables are used in regular expression
property matching (\p{...}). These are stored in subdirectories of
unicore/lib. Mapping tables are used by things like uc(),and various
packages like Unicode::Normalize, and charnames. They are stored either
in unicore/To, or just unicore.
We've discussed on this list before which matching tables to not expose,
and my position is that we should expose all non-obsolete ones that
Unicode doesn't say shouldn't be used. Sorry for the double negative.
There are certain properties that Unicode publishes, but says they
shouldn't be used because they are helper properties for use by Unicode
in constructing other properties that they do want us to use. I'm not
planning to expose those.
But my question this time is, what mapping tables should be exposed?
Changing the decision is simply a matter of adding or deleting the
property name from a list in mktables.
A Unicode character property actually is a mapping from each possible
code point to a characteristic (or property) of that code point. For
example, the lc mapping of 'A' is 'a', and the script mapping of 'A' is
Latin, while its block mapping is ASCII (or its synonym, Basic Latin).
They are essentially functions on code points: uc('a') = 'A'. A binary
property in Unicode is a mapping of all code points to either Y or N. A
mapping table lists all the code points in ordinal order and what they
map to.
The revised mktables constructs mapping tables for every property, and
then uses those to construct the corresponding matching-type tables. A
matching table is merely a list of all the code points that map to a
particular value. The table for \p{Script=Greek}, for example lists all
characters whose script mapping is Greek. In perl, \p{IsUppercase} is
actually a synonym for what Unicode would write as \p{Uppercase=Y}, that
is all characters whose 'Uppercase' mapping is Y, which means all those
that are uppercase letters.
So, the mapping tables are constructed, but currently very few of them
are actually written into files, only the ones that some other piece of
code actually uses. Now, if more of these files had been around
earlier, it would have been somewhat easier to write various modules
like those in Unicode::. Currently, there is no way to access any of
these tables directly through perl. Someone has to write code that
explicitly reads them in, like ucfirst() and cousins do.
One could take the stance that "if you build it, they will come", and
expose all the tables so that they are ready-made for future use. But I
don't actually think it makes sense to expose the binary property
mapping tables, because it's just as easy to use the \p{...} to get full
information from them. And is anyone ever going to want to use the
"simple" casing tables, when perl already uses the "full" ones, which
give more accurate results? And I can't see anyone using the
ISO_Comment table, which Unicode is removing all data from anyway in 5.2?
My current proposal is to newly expose just those properties that aren't
otherwise accessible through perl and not shadows of more complete
properties (like the simple vs full), and not the one that is going away
in 5.2. This results in just a couple in 5.1 dealing with normalization
and mirroring, and that someone might want to use sometime.
On the other hand, for debugging purposes, I'm writing out all the
non-binary tables, and I've found it convenient to be able to quickly
eyeball what script, for example, a character is in. That information
is available, inconveniently, through \p{...}, and so isn't in my
proposal above. I could easily continue to write out these properties
as well.
Also, Unicode suggests rather fancier property matching expressions than
perl supports. Here are some examples from their TR18, using posix
notation:
[[:name=/CJK/:]-[:ideographic:]] The set of all characters with
names that contain CJK that are
not Ideographic
[:name=/\bDOT$/:] The set of all characters with
names that end with the word DOT
[:block=/(?i)arab/:] The set of all characters in
blocks that contain the sequence
of letters "arab" case-
insensitive)
[:toNFKC=/\./:] the set of all characters
with toNFKC values that contain a
literal period
The last is using a mapping table (because of the 'to').
If we were to ever expand to support things like this at some point in
the future, we would need these mapping tables. But then we could
simply change mktables to write them at that time.
Any comments?
Thread Next
-
RFC: what unicode mapping properties to expose
by karl williamson