develooper Front page | perl.unicode | Postings from November 2011

New API available to access Unicode DB, and RFC on changes to it.

Thread Previous
From:
Karl Williamson
Date:
November 21, 2011 12:42
Subject:
New API available to access Unicode DB, and RFC on changes to it.
Message ID:
4ECAB7AE.6040609@khwilliamson.com
Perl 5.15.5, now available, has additions to Unicode::UCD in it to allow 
unfettered programmatic access to the Unicode character data base.  The 
API is quite similar to what was sent out for comment on this list 
several months ago; several changes were required as a result of lessons 
learned during implementation.  This email has an attachment that is an 
html file giving (with a yellow background) the additions since 5.14 to 
the pod.

As a result of this API, it is deprecated to read the files in 
lib/unicore directly.  These may change, and the API will be stable as 
of 5.16.  In the meantime, I'd be happy to have people use this, and 
give me get feedback on any problems with the API or bugs in the code.

And, I do wish to change the API already for certain of the outputs in 
prop_invmap() in order to make them more compact.  For example, take the 
uc() property.  What it currently returns is this (taken from the 
attached pod):

  @$uppers_ranges_ref    @$uppers_maps_ref   Note
        0                 "<code point>"
       97                     65          'a' maps to 'A'
       98                     66          'b' => 'B'
       99                     67          'c' => 'C'
       ...
      120                     88          'x' => 'X'
      121                     89          'y' => 'Y'
      122                     90          'z' => 'Z'
      123                "<code point>"
      181                    924          MICRO SIGN => Greek Cap MU
      182                "<code point>"
      ...
     0x0149              [ 0x02BC 0x004E ]
     0x014A              "<code point>"
     0x014B                 0x014A
      ...


That could be more compactly represented as:
  @$uppers_ranges_ref    @$uppers_maps_ref   Note
        0                      0
       97                    -32          'a-z' maps to 'A'-'Z'
      123                      0
      181                    743          MICRO SIGN => Greek Cap MU
      182                      0
      ...
     0x0149              [ 0x02BC 0x004E ]
     0x014A                    0
     0x014B                   -1
      ...

where the map is to be added to the code point to get the final result. 
  Thus only one entry is needed to represent all 26 ASCII lower case 
character mappings, instead of 26 entries.  This makes such tables 
significantly smaller.  The Perl core currently does a linear search 
through them looking for mappings.  Using the more compact versions 
would speed that up significantly.  The percentage gain is 30-40%, and 
with the mapping for decimal digits the result is a full order of 
magnitude smaller, making the search much much faster.

Returning the delta only makes sense on a few tables, ones that whose 
map is code points, or the decimal digits.

As you can see in the example for 0x0149, I wouldn't propose to make 
deltas of the lists, even though that is inconsistent.  They generally 
require special handling.

Thread Previous


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About