develooper Front page | perl.perl5.porters | Postings from August 2011

Re: RFC: API to access Unicode db files

Thread Previous
From:
Karl Williamson
Date:
August 17, 2011 13:42
Subject:
Re: RFC: API to access Unicode db files
Message ID:
4E4C278C.8080905@khwilliamson.com
Here's a new version of the API for comment, with the addition of 2 
extra functions:



    prop_invlist()
        "prop_invlist" returns an inversion list (described below)
        that defines all the code points for the Unicode property
        given by the input parameter string:

         use Unicode::UCD 'prop_invlist';
         say join ", ", prop_invlist("Any");

         0, 1114112

        An empty list is returned if the given property is unknown;
        the number of elements in the list is returned if called in
        scalar context.

        perluniprops gives the list of properties that this function
        accepts, as well as all the possible forms for them (loose
        matching rules are used on the parameter).  Note that many
        properties can be specified in a compound form, such as

         say join ", ", prop_invlist("Script=Shavian");
         66640, 66688

         say join ", ", prop_invlist("ASCII_Hex_Digit=No");
         0, 48, 58, 65, 71, 97, 103

         say join ", ", prop_invlist("ASCII_Hex_Digit=Yes");
         48, 58, 65, 71, 97, 103

        Inversion lists are a compact way of specifying Unicode
        properties.  The 0th item in the list is the lowest code
        point that has the property-value.  The next item is the
        lowest code point after that one that does NOT have the
        property-value.  And the next item after that is the lowest
        code point after that one that has the property-value, and so
        on.  Put another way, each element in the list gives the
        beginning of a range that has the property-value (for even
        numbered elements), or doesn't have the property-value (for
        odd numbered elements).

        In the final example above, the first ASCII Hex digit is code
        point 48, the character "0", and all code points from it
        through 57 (a "9") are ASCII hex digits.  Code points 58
        through 64 aren't, but 65 (an "A") through 70 (an "F") are,
        as are 97 ("a") through 102 ("f").  103 starts a range of
        code points that aren't ASCII hex digits.  That range extends
        to infinity, which on your computer can be found in the
        variable $Unicode::UCD::MAX_CP.  (This variable is as close
        to infinity as Perl can get on your platform, and may be too
        high for some operations to work; you may wish to use a
        smaller number for your purposes.)

        The name for this data structure stems from the fact that
        each element in the list toggles (or inverts) whether the
        corresponding range is or isn't on the list.

        It is a simple matter to expand out an inversion list to a
        full list of all code points that have the property-value:

         my @invlist = prop_invlist("My Property");
         die "empty" unless @invlist;
         my @full_list;
         for (my $i = 0; $i < @invlist; $i += 2) {
            my $upper = ($i + 1) < @invlist
                        ? $invlist[$i+1] - 1      # In range
                        : $Unicode::UCD::MAX_CP;  # To infinity.  You 
may want
                                                  # to stop much much 
earlier;
                                                  # going this high may 
expose
                                                  # perl bugs with very 
large
                                                  # numbers.
            for my $j ($invlist[$i] .. $upper) {
                push @full_list, $j;
            }
         }

    prop_aliases()
            use Unicode::UCD 'prop_aliases';

            my $full_name = prop_value_aliases("White Space");
            my @all_names = prop_value_aliases("White Space");
            my $short_name = $all_names[0];
            print join ", ", @all_names, "\n";

            XXX

        Most Unicode properties have several synonymous names.
        Typically, there is at least a short name, convenient to
        type, and a long name that more fully describes the property,
        and hence is more easily understood.

        If you know one name for a property, you can use
        "prop_aliases" to find either the long name (when called in
        scalar context), or a list of all of the names, somewhat
        ordered so that the short name is in the 0th element, the
        long name in the next element, and any other synonyms in the
        remaining elements, in no particular order.

        The long name is returned in a form nicely capitalized,
        suitable for printing.

        White space, hyphens, and underscores are ignored in the
        input parameter name.

        If the name is unknown, "undef" is returned.

    prop_value_aliases()
            use Unicode::UCD 'prop_value_aliases';

            my $full_name = prop_value_aliases("Gc", "Punct");
            my @all_names = prop_value_aliases("Gc", "Punct");
            my $short_name = $all_names[0];
            print "The aliases are: ", join ", ", @all_names, "\n";
            print "The fullname is $full_name\n";

            The aliases are: P, Punctuation, Punct
            The fullname is Punctuation

        Some Unicode properties have a restricted set of legal
        values.  For example, all binary properties are restricted to
        just "true" or "false"; and there are only a few dozen
        possible General Categories.

        For such properties, there are usually several synonyms for
        each possible value.  For example, in binary properties,
        truth can be represented by any of the strings, "Y", "Yes",
        "T", or "True"; and the General Category "Punctuation" by
        that string, or "Punct", or simply "P".

        Like property names, there is typically at least a short name
        for each such property-value, and a long name.  If you know
        any name of the property-value, you can use
        "prop_value_aliases"() to get the long name (when called in
        scalar context), or a list of all the names, with the short
        name in the 0th element, the long name in the next element,
        and any other synonyms in the remaining elements, in no
        particular order, except that any all-numeric synonyms will
        be last.

        The long name is returned in a form nicely capitalized,
        suitable for printing.

        White space, hyphens, and underscores are ignored in the
        input parameters.

        If either name is unknown, "undef" is returned.

        If called with a property that doesn't have synonyms for its
        values, it returns the input value, possibly normalized with
        capitalization and underscores.

        For the block property, new-style block names are returned
        (see "Old-style versus new-style block names").

    prop_invmap()
        "prop_invmap" is used to get the complete mapping definition
        for a property, in the form of an inversion map.  An
        inversion map consists of two parallel arrays.  One is an
        ordered list of code points that mark range beginnings, and
        the other gives the value (or mapping) that all code points
        in the corresponding range have.

        "prop_invmap" is called with the name of the desired
        property.  The name is loosely matched, meaning that
        differences in case, white-space, hyphens, and underscores
        are not meaningful.  Many Unicode properties have more than
        one name (or alias).  "prop_invmap" understands all of these.
        "undef" is returned if the property name is unknown.

        It is a fatal error to call this function except in list
        context.

        In addition to the the two arrays that form the inversion
        map, "prop_invmap" returns two other values, one is a scalar
        that gives some details as to the format of the entries of
        the map array; the other is used for specialized purposes,
        described at the end of this section.

        This means that "prop_invmap" returns a 4 element list.  For
        example,

         my ($blocks_ranges_ref, $blocks_maps_ref, $format, $default)
                                              = prop_invmap("Block");

        In this call, the two arrays will be populated as shown below
        (for Unicode 6.0):

         Index  @blocks_ranges  @blocks_maps
           0        0x0000      Basic Latin
           1        0x0080      Latin-1 Supplement
           2        0x0100      Latin Extended-A
           3        0x0180      Latin Extended-B
           4        0x0250      IPA Extensions
           5        0x02B0      Spacing Modifier Letters
           6        0x0300      Combining Diacritical Marks
           7        0x0370      Greek and Coptic
           8        0x0400      Cyrillic
          ...
         233        0x2B820     No_Block
         234        0x2F800     CJK Compatibility Ideographs Supplement
         235        0x2FA20     No_Block
         236        0xE0000     Tags
         237        0xE0080     No_Block
         238        0xE0100     Variation Selectors Supplement
         239        0xE01F0     No_Block
         240        0xF0000     Supplementary Private Use Area-A
         241        0x100000    Supplementary Private Use Area-B
         242        0x110000    No_Block

        The first line (with Index 0) means that the value for code
        point 0 is "Basic Latin".  The entry "0x0080" in the
        @blocks_ranges column in the second line means that the value
        from the first line, "Basic Latin", extends to all code
        points in the range up to but not including 0x0080, that is,
        to 255.  In other words, the code points from 0 to 255 are
        all in the "Basic Latin" block.  Similarly, all code points
        in the range from 0x0080 up to (but not including) 0x0100 are
        in the block named "Latin-1 Supplement", etc.  (Notice that
        the return is the old-style block names; see "Old-style
        versus new-style block names").

        The final line (with Index 242) means that the value for all
        code points above the legal Unicode maximum code point have
        the value "No_Block", which is the term Unicode uses for a
        non-existing block.

        The arrays completely specify the mappings for all possible
        code points.  The final element in an inversion map returned
        by this function will always be for the range that consists
        of all the code points that aren't legal Unicode, but that
        are expressible on the platform.  (That is, it starts with
        code point 0x110000, the first code point above the legal
        Unicode maximum, and extends to infinity.) The value for that
        range will be the same that any normal unassigned code point
        has for the specified property.  (Certain unassigned code
        points are not "normal"; for example the non-character code
        points, or those in blocks that are to be written right-to-
        left.  The range value will not necessarily be the same as
        those code points have.)  It could be argued that, instead of
        treating these as unassigned Unicode code points, the value
        for this range should be "undef".  You can make that decision
        and change the returned array accordingly.

        The maps are almost always simple scalars that should be
        interpreted as-is.  These values are those given in the
        Unicode data files, which may be inconsistent as to
        capitalization and which synonym for a property-value is
        given.  The results may be normalized by using the
        "prop_value_aliases()" function.

        There are exceptions to the simple scalar maps.  Some
        properties have some elements in their map list that are
        themselves lists of scalars; and some special strings are
        returned that are not to be interpreted as-is.  Element [2]
        (placed into $format in the example above) of the returned 4
        element list tells you if the map has any of these special
        elements, as follows:

        "s" means all the elements of the map array are simple
            scalars.  Almost all properties are like this, like the
            "block" example above.

        "sl"
            means that some of the map array elements have the form
            given by "s", and the rest are lists of scalars.  For
            example, here is a portion of the output of calling
            "prop_invmap"() with the "Script Extensions" property:

             @scripts_ranges  @scripts_maps
                  ...
                  0x0953      Deva
                  0x0964      [ Beng Deva Guru Orya ]
                  0x0966      Deva
                  0x0970      Common

            Here, the code points 0x964 and 0x965 are used in the
            Bengali, Devanagari, Gurmukhi, and Oriya  scripts.

        "r" means that all the elements of the map array are either
            rational numbers or the string "NaN", meaning "Not a
            Number".  A rational number is either an integer, or two
            integers separated by a solidus ("/").  The second
            integer represents the denominator of the division
            implied by the solidus, and is guaranteed not to be 0.
            If you want to convert them to scalar numbers, you can
            use something like this:

             my ($format, $invlist_ref, $invmap_ref)
                                     = prop_invmap($property);
             if ($format && $format eq "r") {
                 map { $_ = eval $_ } @$invmap_ref;
             }

            Here's some entries from the output of the property "Nv",
            which has format "r".

             @numerics_ranges  @numerics_maps        Note
                    0x00             "NaN"
                    0x30             0              DIGIT 0
                    0x31             1
                    0x32             2
                    ...
                    0x37             7
                    0x38             8
                    0x39             9              DIGIT 9
                    0x3A             "NaN"
                    0xB2             2              SUPERSCRIPT 2
                    0xB3             3              SUPERSCRIPT 2
                    0xB4             "NaN"
                    0xB9             1              SUPERSCRIPT 1
                    0xBA             "NaN"
                    0xBC             1/4            VULGAR FRACTION 1/4
                    0xBD             1/2            VULGAR FRACTION 1/2
                    0xBE             3/4            VULGAR FRACTION 3/4
                    0xBF             "NaN"
                    0x660            0          ARABIC-INDIC DIGIT ZERO

        "c" is like "s" in that all the map array elements are
            scalars, but some of them are the special string
            "<code point>", meaning that the map of each code point
            in the corresponding range in the inversion list is the
            code point itself.  For example, in:

             my ($format, $uppers_ranges_ref, $uppers_maps_ref)
                       = prop_invmap("Simple_Uppercase_Mapping");

            the returned arrays look like this:

             @$uppers_ranges_ref    @$uppers_maps_ref   Note
                   0                 "<code point>"
                  97                     65          'a' maps to 'A'
                  98                     66          'b' => 'B'
                  99                     67          'c' => 'C'
                  ...
                 120                     88          'x' => 'X'
                 121                     89          'y' => 'Y'
                 122                     90          'z' => 'Z'
                 123                "<code point>"
                 181                    924          MICRO SIGN =>
                                                     Greek Cap MU
                 182                "<code point>"
                 ...

            The first line means that the uppercase of code point 0
            is 0; the uppercase of code point 1 is 1; ...  of code
            point 96 is 96.  Without the "<code_point>" notation,
            every code point would have to have an entry.  This would
            mean that the arrays would each have more than a million
            entries to list just the legal Unicode code points!

        "cl"
            means that some of the map array elements have the form
            given by "c", and the rest are ordered lists of code
            points.  For example, in:

             my ($format, $uppers_ranges_ref, $uppers_maps_ref)
                                = prop_invmap("Uppercase_Mapping");

            the returned arrays look like this:

             @$uppers_ranges_ref    @$uppers_maps_ref       Note
                   0                 "<code point>"
                  97                     65
                 ...
                 122                     90
                 123                "<code point>"
                 181                    924
                 182                "<code point>"
                 ...
                0x0149              [ 0x02BC 0x004E ]

            This is the full Uppercase_Mapping property (as opposed
            to the Simple_Uppercase_Mapping given in the example for
            "c").  The only difference between the two in the ranges
            shown is that the code point at 0x0149 (LATIN SMALL
            LETTER N PRECEDED BY APOSTROPHE) maps to a string of two
            characters, 0x02BC (MODIFIER LETTER APOSTROPHE) followed
            by 0x004E (LATIN CAPITAL LETTER N).

        "n" means the Name property.  All the elements of the map
            array are simple scalars, but some of them contain
            special strings that require more work to get the actual
            name.

            Entries such as:

             CJK UNIFIED IDEOGRAPH-<code point>

            mean that the name for the code point is "CJK UNIFIED
            IDEOGRAPH-" with the code point (expressed in
            hexadecimal) appended to it (similarly for "CJK
            COMPATIBILITY IDEOGRAPH-<code point>").

            Also, entries like

             <hangul syllable>

            means that the name is algorithmically calculated.  This
            is easily done by the function charnames::viacode().

            Note that for control characters ("Gc=cc"), Unicode's
            data files have the string ""control"", but the real name
            of each of these characters is the empty string.  This
            function returns the real name.

        "d" means the Decomposition_Mapping property.  Like "n", this
            property uses

             <hangul syllable>

            for those code points whose decomposition is
            algorithmically calculated.  These can be generated via
            the function Unicode::Normalize::NFD().

            Otherwise, this property is like "cl" properties.

            Note that the mapping is the one that is specified in the
            Unicode data files, and to get the final decomposition,
            it may need to be applied recursively.

        A binary search can be used to quickly find a code point in
        the inversion list, and hence its corresponding mapping.

        The final element ([3], assigned to $default in the "block"
        example) in the list returned by this function may be useful
        for applications that wish to convert the returned inversion
        map data structure into some other, such as a hash.  It gives
        the mapping that most code points map to under the property.
        If you establish the convention that any code point not
        explicitly listed in your data structure maps to this value,
        you can potentially make your data structure much smaller.
        As you construct your data structure from the one returned by
        this function, simply ignore those ranges that map to this
        value, generally called the "default" value.

        One internal Perl property is accessible by this function.
        "Perl_Decimal_Digit" returns an inversion map in which all
        the Unicode decimal digits map to their numeric values, and
        everything else to the empty string, like so:

         @digits    @values
         0x0000       ""
         0x0030       0
         0x0031       1
         0x0032       2
         0x0033       3
         0x0034       4
         0x0035       5
         0x0036       6
         0x0037       7
         0x0038       8
         0x0039       9
         0x003A       ""
         0x0660       0
         0x0661       1
         ...

    Old-style versus new-style block names
        Unicode publishes the names of blocks in two different
        styles, though the two are equivalent under Unicode's loose
        matching rules.

        The original style uses blanks and hyphens in the block names
        (except for "No_Block"), like so:

         Miscellaneous Mathematical Symbols-B

        The newer style replaces these with underscores, like this:

         Miscellaneous_Mathematical_Symbols_B

        This newer style is consistent with the values of other
        Unicode properties.  To preserve backward compatibility, all
        the functions in Unicode::UCD that return block names (except
        one) return the old-style ones.  That one function,
        "prop_value_aliases"() can be used to convert from old-style
        to new-style:

         my $new_style = prop_values_aliases("block", $old_style);


Thread Previous


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About