develooper Front page | perl.perl5.porters | Postings from July 2011

RFC: Unicode::UCD::prop_invlist()

From:
Karl Williamson
Date:
July 10, 2011 21:20
Subject:
RFC: Unicode::UCD::prop_invlist()
Message ID:
4E1A79E1.7050006@khwilliamson.com
I've put this out on the perl Unicode list for comment.  It is in 
response to a need that a user had on that list.  Currently, some 
information from Perl's Unicode data base is not easily accessible 
without reading the files directly, files that we don't want to be held 
to having the same format forever, and the files caution about that at 
the beginning.  Nonetheless, it is better to give people a reasonable 
API to use than to expect them to not use the forbidden files that give 
them the data that is otherwise not accessible"

prop_invlist

"prop_invlist" returns an inversion list (see below) that defines all 
the code points for the Unicode property given by the input parameter 
string:

  say join ", ", prop_invlist("Any");
  0, 1114112

An empty list is returned if the given property is unknown.

perluniprops gives the list of properties that this function accepts, as 
well as all the possible forms for them. Note that many properties can 
be specified in a compound form, such as

  say join ", ", prop_invlist("Script=Shavian");
  66640, 66688

  say join ", ", prop_invlist("ASCII_Hex_Digit=No");
  0, 48, 58, 65, 71, 97, 103

  say join ", ", prop_invlist("ASCII_Hex_Digit=Yes");
  48, 58, 65, 71, 97, 103

Inversion lists are a compact way of specifying Unicode properties. The 
0th item in the list is the lowest code point that has the 
property-value. The next item is the lowest code point after that one 
that does NOT have the property-value. And the next item after that is 
the lowest code point after that one that has the property-value, and so 
on. Put another way, each element in the list gives the beginning of a 
range that has the property-value (for even numbered elements), or 
doesn't have the property-value (for odd numbered elements).

In the final example above, the first ASCII Hex digit is code point 48, 
the character "0", and all code points from it through 57 (a "9") are 
ASCII hex digits. Code points 58 through 64 aren't, but 65 (an "A") 
through 70 (an "F") are, as are 97 ("a") through 102 ("f"). 103 starts a 
range of code points that aren't ASCII hex digits. That range extends to 
infinity, which on your computer can be found in the variable 
$Unicode::UCD::MAX_CP.

It is a simple matter to expand out an inversion list to a full list of 
all code points that have the property-value:

  my @invlist = prop_invlist("My Property");
  die "empty" unless @invlist;
  my @full_list;
  for (my $i = 0; $i < @invlist; $i += 2) {
     my $upper = ($i + 1) < @invlist
                 ? $invlist[$i+1] - 1      # In range
                 : $Unicode::UCD::MAX_CP;  # To infinity.  You may want
                                           # to stop much much earlier;
                                           # going this high may expose
                                           # perl bugs with very large
                                           # numbers.
     for my $j ($invlist[$i] .. $upper) {
         push @full_list, $j;
     }
  }



nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About