develooper Front page | perl.perl5.porters | Postings from July 2011

RFC: API to access Unicode db files

Thread Next
Karl Williamson
July 21, 2011 08:04
RFC: API to access Unicode db files
Message ID:
Some applications are finding it necessary to read in the Unicode files 
that mktables generates.  For example, grepping through CPAN indicates 
that Text::Unicode::Equivalents reads  This, and most 
of the other generated files are marked for internal use only, because 
we wish to reserve the right to change them around, etc.  But 
applications currently have no feasible alternative.  Prior to 5.14, we 
delivered the full Unicode db files that the Unicode consortium 
publishes, and whose format is guaranteed not to change.  But we dropped 
those files in 5.14 to save disk space.

I'm proposing a new function Unicode::UCD::prop_invmap() to return the 
contents of those files in a Unicode-centric way, so that applications 
can use it and we can deprecate non-core use of our generated files.

The function returns an inversion map, which is a data structure more 
used in the Unicode world than the Perl world.  It consists of two 
parallel arrays.  I suppose a more Perl-centric data structure would be 
an array of hashes, but the inversion map seems simpler to me to manipulate.

(This function would be in addition to the previously rfc'd function 
Unicode::UCD::prop_invlist() which would return a list of all code 
points that match a property-value.)


=head2 prop_invmap

C<prop_invmap> is used to get the complete mapping definition for the input
property, in the form of an inversion map.  An inversion map consists of two
parallel arrays.  One is an ordered list of code points that mark range
beginnings, and the other gives the value that all code points in the
corresponding range have.  C<prop_invmap> is called with the name of the
desired property, and references to the two arrays, which it fills.  For

  prop_invmap("Numeric_Value", \@numerics_ranges, \@numerics_maps);

will populate the arrays as shown below:

  @numerics_ranges  @numerics_maps        Note
         0x00             "NaN"          NaN stands for "Not a Number"
         0x30             0              DIGIT 0
         0x31             1
         0x32             2
         0x37             7
         0x38             8
         0x39             9              DIGIT 9
         0x3A             "NaN"
         0xB2             2              SUPERSCRIPT 2
         0xB3             3              SUPERSCRIPT 2
         0xB4             "NaN"
         0xB9             1              SUPERSCRIPT 1
         0xBA             "NaN"
         0xBC             0.25           VULGAR FRACTION 1/4
         0xBD             0.5            VULGAR FRACTION 1/2
         0xBE             0.75           VULGAR FRACTION 3/4
         0xBF             "NaN"
         0x660            0              ARABIC-INDIC DIGIT ZERO
         ...              ...
      0x110000            undef

The second line means that the value for the code point 0x30 (which is 
0") is 0.  The first line means that all code points in the range from 
0x00 to
0x2F (which is 0x30 (from the second line) - 1) have the value "NaN".
The final line means that the value for all code points above the legal
Unicode maximum code point have the value C<undef> (not the string 

The arrays completely specify the mappings for all possible code points.

The special string S<C<"E<lt>code pointE<gt>">> is used to specify that
the value of a code point is itself.  For example, the beginnings of the
arrays for

  prop_invmap("Uppercase_Mapping", \@uppers_ranges, \@uppers_maps);

look like this:

  @uppers_ranges    @uppers_maps       Note
        0          "<code point>"
       97              65          'a' maps to 'A'
       98              66          'b' => 'B'
       99              67          'c' => 'C'
      120              88          'x' => 'X'
      121              89          'y' => 'Y'
      122              90          'z' => 'Z'
      123         "<code point>"
      181             924          MICRO SIGN => Greek Cap MU
      182         "<code point>"
      223           [ 83 83 ]      SHARP S => 'SS'
      224             192

The first line means that the uppercase of code point 0 is 0, of 1 is 1, ...
of 96 is 96.  Without the C<"E<lt>code_pointE<gt>"> notation, every code 
would have to have an entry.  This would mean that the arrays would each 
more than a million entries to list just the legal Unicode code points!

In some properties some code points map to a sequence of multiple code 
For those, the corresponding entries in the map array are not scalars, but
references to anonymous arrays containing the ordered list of code points
mapped to, as shown in the example above for 223.

The "Name" property map includes entries such as


This means that the name for the code point is "CJK UNIFIED IDEOGRAPH-"
with the code point (expressed in hexadecimal) appended to it.  Also, the
notation "E<lt>hangul syllableE<gt>" occurs in this property, meaning 
that the
name is algorithmically calculated.  These names can be generated via the
function C<charnames::viacode>().

The "Decomposition_Mapping" property also uses "E<lt>hangul 
syllableE<gt>" for
those code points whose decomposition is algorithmically calculated.  These
can be generated via the function C<Unicode::Normalize::NFD>().  This 
contains many occurrences of code points whose mappings are ordered lists of
other code points.

The return value is
C<undef> if the property is unknown;
C<s> if all the elements of the map array are simple scalars;
C<n> for the Name property, which has the complications described above;
C<d> for the Decomposition_Mapping property (complications already 
otherwise C<c> if some of map array elements are S<C<"E<lt>code 
and C<l> if additionally some are lists of code points.

A binary search can be used to quickly find a code point in the inversion
list, and hence its corresponding mapping.


Thread Next Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at | Group listing | About