Front page | perl.perl5.porters |
Postings from July 2011
RFC: Unicode::UCD::prop_invlist()
From:
Karl Williamson
Date:
July 10, 2011 21:20
Subject:
RFC: Unicode::UCD::prop_invlist()
Message ID:
4E1A79E1.7050006@khwilliamson.com
I've put this out on the perl Unicode list for comment. It is in
response to a need that a user had on that list. Currently, some
information from Perl's Unicode data base is not easily accessible
without reading the files directly, files that we don't want to be held
to having the same format forever, and the files caution about that at
the beginning. Nonetheless, it is better to give people a reasonable
API to use than to expect them to not use the forbidden files that give
them the data that is otherwise not accessible"
prop_invlist
"prop_invlist" returns an inversion list (see below) that defines all
the code points for the Unicode property given by the input parameter
string:
say join ", ", prop_invlist("Any");
0, 1114112
An empty list is returned if the given property is unknown.
perluniprops gives the list of properties that this function accepts, as
well as all the possible forms for them. Note that many properties can
be specified in a compound form, such as
say join ", ", prop_invlist("Script=Shavian");
66640, 66688
say join ", ", prop_invlist("ASCII_Hex_Digit=No");
0, 48, 58, 65, 71, 97, 103
say join ", ", prop_invlist("ASCII_Hex_Digit=Yes");
48, 58, 65, 71, 97, 103
Inversion lists are a compact way of specifying Unicode properties. The
0th item in the list is the lowest code point that has the
property-value. The next item is the lowest code point after that one
that does NOT have the property-value. And the next item after that is
the lowest code point after that one that has the property-value, and so
on. Put another way, each element in the list gives the beginning of a
range that has the property-value (for even numbered elements), or
doesn't have the property-value (for odd numbered elements).
In the final example above, the first ASCII Hex digit is code point 48,
the character "0", and all code points from it through 57 (a "9") are
ASCII hex digits. Code points 58 through 64 aren't, but 65 (an "A")
through 70 (an "F") are, as are 97 ("a") through 102 ("f"). 103 starts a
range of code points that aren't ASCII hex digits. That range extends to
infinity, which on your computer can be found in the variable
$Unicode::UCD::MAX_CP.
It is a simple matter to expand out an inversion list to a full list of
all code points that have the property-value:
my @invlist = prop_invlist("My Property");
die "empty" unless @invlist;
my @full_list;
for (my $i = 0; $i < @invlist; $i += 2) {
my $upper = ($i + 1) < @invlist
? $invlist[$i+1] - 1 # In range
: $Unicode::UCD::MAX_CP; # To infinity. You may want
# to stop much much earlier;
# going this high may expose
# perl bugs with very large
# numbers.
for my $j ($invlist[$i] .. $upper) {
push @full_list, $j;
}
}
-
RFC: Unicode::UCD::prop_invlist()
by Karl Williamson