develooper Front page | perl.perl5.porters | Postings from February 2012

RFC & PROPOSAL: add perlunicook.pod to std docset

Thread Next
From:
Tom Christiansen
Date:
February 24, 2012 19:35
Subject:
RFC & PROPOSAL: add perlunicook.pod to std docset
Message ID:
31258.1330140850@chthon
After mining Camel4 some more, I've added recipes for dbm, locales, and
unihan, puting me up to 42 recipes, which seems a good place to stop.  Is
there any reason why this  should *not* be included as part of the standard
Perl documentation?  It's something of a FAQ, something of a cheat-sheet,
something of a cookbook.  Here, take two and call me in the morning. 😷

Most of this is supposed to be about making easy stuff easy, although a bit is
admittedly about making hard stuff possible.  But no matter how you look at
it, it's frightening how little of this can be done *at all*, let alone with
this amount of slickness, in any other programming language.  Most of this 
is just impossible anywhere else.

 ℞ 0. Standard preamble
 ℞ 1. Generic Unicode-savvy filter
 ℞ 2. Fine-tuning Unicode warnings
 ℞ 3. Declare source in utf8 for identifiers and literals
 ℞ 4. Characters and their numbers
 ℞ 5. Unicode literals by character number
 ℞ 6. Get character name by number
 ℞ 7. Get character number by name
 ℞ 8. Unicode named characters
 ℞ 9. Unicode named sequences
 ℞10. Custom named characters
 ℞11. Names of CJK codepoints
 ℞12. Unicode casing
 ℞13. Unicode case-insensitive comparisons
 ℞14. Match Unicode linebreak sequence in regex
 ℞15. Get character category
 ℞16. Disabling Unicode-awareness in builtin charclasses
 ℞17. Match Unicode properties in regex with \p, \P
 ℞18. Custom character properties
 ℞19. Convert non-ASCII Unicode numerics
 ℞20. Match Unicode grapheme cluster in regex
 ℞21. Extract by grapheme instead of by codepoint (regex)
 ℞22. Extract by grapheme instead of by codepoint (substr)
 ℞23. Reverse string by grapheme
 ℞24. String length in graphemes
 ℞25. Unicode column-width for printing
 ℞26. Unicode normalization
 ℞27. Unicode collation
 ℞28. Case- *and* accent-insensitive Unicode sort
 ℞29. Unicode locale collation
 ℞30. Making "cmp" work on text instead of codepoints
 ℞31. Case- *and* accent-insensitive comparisons
 ℞32. Case- *and* accent-insensitive locale comparisons
 ℞33. Unicode linebreaking
 ℞34. Decode program arguments as utf8
 ℞35. Decode program arguments as locale encoding
 ℞36. Declare STD{IN,OUT,ERR} to be utf8
 ℞37. Declare STD{IN,OUT,ERR} to locale encoding
 ℞38. Make all I/O default to utf8
 ℞39. Open file with encoding
 ℞40. Explicit encode/decode
 ℞41. Unicode text in DBM hashes, the tedious way
 ℞42. Unicode text in DBM hashes, the easy way

I'm not sure about the order: ℞34–39 seem like they should go a lot earlier.

Yes, it’s unabashedly heavy with Unicode, even brazen. But it has to be, or it
won't work.  The local man system might stand with some fine-tuning (we screw
up on UTF-8 manpages you know!!), since I don't think we've been quite so
brazen with this sort of thing before.

    $ uniwc perlunicook.pod 
       Paras    Lines    Words   Graphs    Chars    Bytes File
         200      656     2202    16426    16428    16643 perlunicook.pod

    $ pod2text unicook.pod | uniwc
       Paras    Lines    Words   Graphs    Chars    Bytes File
         142      576     2142    17419    17421    17636 standard input

Yes, it's in NFC — that’s as few GCSes as I can squash it down to.

--tom

=encoding utf8

=head1 NAME

perlunicook - cookbookish examples of handling Unicode in Perl

=head1 DESCRIPTION

Unless otherwise notes, all examples below assume this standard 
preamble, with the C<#!> adjusted to work on your system:

 #!/usr/bin/env perl

 use utf8;
 use v5.12;  # or later
 use strict;  
 use warnings;
 use warnings  qw(FATAL utf8);
 use open      qw(:std :utf8);
 use charnames qw(:full :short);  # unneeded in v5.16

This does makes you C<binmode> your binary streams, or open them 
with C<:raw>, but that's the only way to get at them portably anyway.

B<WARNING>: C<use autoload> and C<use open> do not get along with each other.

=head1 EXAMPLES

=head2 Generic Unicode-savvy filter 

Always decompose on the way in, then recompose on the way out.

 use Unicode::Normalize;

 while (<>) {
     $_ = NFD($_);
     ...
 } continue {
     print NFC($_);
 } 

=head2 Fine-tuning Unicode warnings

As of v5.14, Perl distinguishes three sublasses of UTF‑8 warnings.

 use v5.14;
 no warnings "nonchars";     # the 66 forbidden characters
 no warnings "surrogates";   # UTF-16/CESU-8 nonsense
 no warnings "non_unicode";  # for codepoints over 0x10_FFFF

=head2 Declare source in utf8 for identifiers and literals

Without the all-critical C<use utf8> declaration, putting UTF‑8 in your
literals and identifiers won’t work right.  If you used the standard 
preamble just given above, this already happened.  If you did, you can 
do things like this:

 use utf8;

 my $measure   = "Ångström";
 my @μsoft     = qw( cp852 cp1251 cp1252 );
 my @ὑπέρμεγας = qw( ὑπέρ  μεγας );
 my @鯉        = qw( koi8–f koi8–u koi8–r );
 my $motto     = "👪 💗 🐪"; # FAMILY, GROWING HEART, DROMEDARY CAMEL

=head2 Characters and their numbers

The C<ord> and C<chr> functions work transparently on all codepoints.

 # ASCII characters
 ord("A")
 chr(65)

 # characters from the Basic Multilingual Plane
 ord("Σ")
 chr(0x3A3)

 # beyond the BMP
 ord("𝑛") 
 chr(0x1D45B)

 # beyond Unicode! (up to MAXINT)
 ord("\x{20_0000}")
 chr(0x20_0000)

=head2 Unicode literals by character number

In a literal, you may specify a character by its number
using the C<\x{I<HHHHHH>}> escape.

 String: "\x{3a3}"    
 Regex:  /\x{3a3}/

 String: "\x{1d45b}"  
 Regex:  /\x{1d45b}/  

 # even non-BMP ranges in regex work fine
 /[\x{1D434}-\x{1D467}]/ 

=head2 Get character name by number

 use charnames ();
 my $name = charnames::viacode(0x03A3); 

=head2 Get character number by name

 use charnames ();
 my $number = charnames::vianame("GREEK CAPITAL LETTER SIGMA");

=head2 Unicode named characters

In v5.16, there is an implicit

 use charnames qw(:full :short);

But prior to that release, you must be explicit about which charnames you
want. You should still specify a script if you want short names that are
script-specific.

 use charnames qw(:full :short greek);

 "\N{MATHEMATICAL ITALIC SMALL N}"      # :full
 "\N{GREEK CAPITAL LETTER SIGMA}"       # :full
 "\N{Greek:Sigma}"                      # :short
 "\N{epsilon}"                          # greek

The v5.16 release also supports a C<:loose> import for loose matching of
character names.

=head2 Unicode named sequences

These look just like character names but return multiple codepoints.
Notice the C<%vx> vector-print functionality in C<printf>.

 use charnames qw(:full);
 my $seq = "\N{LATIN CAPITAL LETTER A WITH MACRON AND GRAVE}";
 printf "U+%v04X\n", $seq;
 U+0100.0300

=head2 Custom named characters

Give your own nicknames to existing characters, or to unnamed
private-use characters.

 use charnames ":full", ":alias" => {
     ecute => "LATIN SMALL LETTER E WITH ACUTE",
     "APPLE LOGO" => 0xF8FF, # private use character
 };

 "\N{ecute}"
 "\N{APPLE LOGO}"

=head2 Names of CJK codepoints

Sinograms like “東京” come back with character names of 
C<CJK UNIFIED IDEOGRAPH-6771> and C<CJK UNIFIED IDEOGRAPH-4EAC>, 
because their “names” vary.  The CPAN C<Unicode::Unihan> module 
has a large database for decoding these, provided you know how
to understand its output.

    # cpan -i Unicode::Unihan
    use Unicode::Unihan;
    my $str = "東京";
    my $unhan = new Unicode::Unihan;
    for my $lang (qw(Mandarin Cantonese Korean JapaneseOn JapaneseKun)) {
        printf "CJK $str in %-12s is ", $lang;
        say $unhan->$lang($str);
    }

prints:

    CJK 東京 in Mandarin     is DONG1JING1
    CJK 東京 in Cantonese    is dung1ging1
    CJK 東京 in Korean       is TONGKYENG
    CJK 東京 in JapaneseOn   is TOUKYOU KEI KIN
    CJK 東京 in JapaneseKun  is HIGASHI AZUMAMIYAKO

If you have a specific romanization scheme in mind, 
use the specific module:

    # cpan -i Lingua::JA::Romanize::Japanese
    use Lingua::JA::Romanize::Japanese;
    my $k2r = new Lingua::JA::Romanize::Japanese;
    my $str = "東京";
    say "Japanese for $str is ", $k2r->chars($str);

prints

    Japanese for 東京 is toukyou

=head2 Unicode casing

Unicode casing is very different from ASCII casing.

 uc("henry ⅷ")  # "HENRY Ⅷ"
 uc("tschüß")   # "TSCHÜSS"  notice ß => SS

 # both are true:
 "tschüß"  =~ /TSCHÜSS/i   # notice ß => SS
 "Σίσυφος" =~ /ΣΊΣΥΦΟΣ/i   # notice Σ,σ,ς sameness

=head2 Unicode case-insensitive comparisons

Also available in the CPAN L<Unicode::CaseFold> module, 
the new C<fc> “foldcase” function from v5.16 grants 
access to the same Unicode casefolding as the C</i>
pattern modifier has always used:

 use feature "fc"; # fc() function is from v5.16

 # sort case-insensitively 
 my @sorted = sort { fc($a) cmp fc($b) } @list;

 # both are true:
 fc("tschüß")  eq fc("TSCHÜSS")
 fc("Σίσυφος") eq fc("ΣΊΣΥΦΟΣ")

=head2 Match Unicode linebreak sequence in regex

A Unicode linebreak matches the two-character CRLF
grapheme or any of seven vertical whitespace characters.
Good for dealing with intransigent Microsoft systems.

 \R

 s/\R/\n/g;  # normalize all linebreaks to \n

=head2 Get character category

Find the general category of a numeric codepoint.

 use Unicode::UCD qw(charinfo);
 my $cat = charinfo(0x3A3)->{category};  # "Lu"

=head2 Disabling Unicode-awareness in builtin charclasses

Disable C<\w>, C<\b>, C<\s>, C<\d>, and the POSIX
classes from working correctly on Unicode.

 use v5.14;
 use re "/a";

 # OR

 my($num) = $str =~ /(\d+)/a;

Or just use specific un-Unicode properties, like C<\p{ahex}>
and C<\p{posix_digit>}.  Properties still work normally
no matter what.

=head2 Match Unicode properties in regex with \p, \P

These all match a single codepoint with the given
property.  Use C<\P> in place of C<\p> to match 
one lacking that property.

 \pL, \pN, \pS, \pP, \pM, \pZ, \pC
 \p{Sk}, \p{Ps}, \p{Lt}
 \p{alpha}, \p{upper}, \p{lower} 
 \p{Latin}, \p{Greek}
 \p{script=Latin}, \p{script=Greek}
 \p{East_Asian_Width=Wide}, \p{EA=W}
 \p{Line_Break=Hyphen}, \p{LB=HY}
 \p{Numeric_Value=4}, \p{NV=4}

=head2 Custom character properties

Define at compile-time your own custom character 
properties for use in regexes.  

 # using private-use characters
 sub In_Tengwar { "E000\tE07F\n" } 

 if (/\p{In_Tengwar}/) { ... }

 # blending existing properties
 sub Is_GraecoRoman_Title {<<'END_OF_SET'}
 +utf8::IsLatin
 +utf8::IsGreek
 &utf8::IsTitle
 END_OF_SET

 if (/\p{Is_GraecoRoman_Title}/ { ... }

=head2 Convert non-ASCII Unicode numerics

Unless you’ve used C</a>, C<\d> matches more than ASCII digits.  

 use v5.14;  # needed for num() function
 use Unicode::UCD qw(num);
 my $str = "got Ⅻ and ४५६७ and ⅞ and here";
 my @nums = ();
 while (/$str =~ (\d+|\N)/g) {  # not just ASCII!
    push @nums, num($1);
 } 
 say "@nums";   #     12      4567      0.875         

 use charnames qw(:full);
 my $nv = num("\N{RUMI DIGIT ONE}\N{RUMI DIGIT TWO}");

=head2 Match Unicode grapheme cluster in regex

Programmer-visible “characters” are codepoints matched by C</./s>,
but yser-visible “characters” are graphemes matched by C</\X/>. 

 # Find vowel plus any diacritics,underlining,etc.
 my $nfd = NFD($orig);
 $nfd =~ /(?=[aeiou])\X/i

=head2 Extract by grapheme instead of by codepoint (regex)

 # match and grab five first graphemes
 my($first_five) = $str =~ /^(\X{5})/;

=head2 Extract by grapheme instead of by codepoint (substr)

 # cpan -i Unicode::GCString
 use Unicode::GCString;
 my $gcs = Unicode::GCString->new($str);
 my $first_five = $gcs->substr(0, 5);

=head2 Reverse string by grapheme

Reversing by codepoint messes up diacritics, mistakenly converting
C<crème brûlée> into C<éel̂urb em̀erc> instead of into C<eélûrb emèrc>;
so reverse by grapheme instead.  Both these approaches work
right no matter what normalization the string is in:

 $str = join("", reverse $str =~ /\X/g);

 # OR: cpan -i Unicode::GCString
 use Unicode::GCString;
 $str = reverse Unicode::GCString->new($str);

=head2 String length in graphemes

Count by grapheme, not by codepoint.

 my $count = 0;
 while ($str =~ /\X/) { $count++ }

  # OR: cpan -i Unicode::GCString
 use Unicode::GCString;
 $gcs = Unicode::GCString->new($str);
 my $count = $gcs->length;

=head2 Unicode column-width for printing

Perl’s C<printf>, C<scriptf>, and C<format> think all 
codepoints take up 1 print column, but many take 0 or 2.

  # cpan -i Unicode::GCString
 use Unicode::GCString;
 $gcs = Unicode::GCString->new($str);
 my $cols = $gcs->columns;
 printf "%*s\n", $cols, $str,

=head2 Unicode normalization

Typically render into NFD on input and NFC on output.
Using either of the NFK functions improves recall on searches.
Note that this is about much more than just pre-combined compatibility glyphs.

 use Unicode::Normalize;
 my $nfd  = NFD($orig);
 my $nfc  = NFC($orig);
 my $nfkd = NFKD($orig);
 my $nfkc = NFKC($orig);

=head2 Unicode collation

Text sorted by numeric codepoint follows no reasonable order;
use the UCA for sorting text.

 use Unicode::Collate;
 my $col = Unicode::Collate->new();
 my @list = $col->sort(@old_list);

=head2 Case- I<and> accent-insensitive Unicode sort

Specify a collation strength of level 1 to ignore case and
diacritics, only looking at the basic character.

 use Unicode::Collate;
 my $col = Unicode::Collate->new(level => 1);
 my @list = $col->sort(@old_list);

=head2 Unicode locale collation

Some locales have special sorting rules.

 # either use v5.12, OR: cpan -i Unicode::Collate::Locale
 use Unicode::Collate::Locale;
 my $col = Unicode::Collate::Locale->new(locale => "de__phonebook");
 my @list = $col->sort(@old_list);

=head2 Making C<cmp> work on text instead of codepoints

Instead of this:

 @srecs = sort {
     $b->{AGE}   <=>  $a->{AGE}
                 ||
     $a->{NAME}  cmp  $b->{NAME}
 } @recs;

Use this:

 my $coll = Unicode::Collate->new();
 for my $rec (@recs) {
     $rec->{NAME_key} = $coll->getSortKey( $rec->{NAME} );
 }
 @srecs = sort {
     $b->{AGE}       <=>  $a->{AGE}
                     ||
     $a->{NAME_key}  cmp  $b->{NAME_key}
 } @recs;

=head2 Case- I<and> accent-insensitive comparisons

Use a collator object to compare Unicode text by character
instead of by codepoint.

 use Unicode::Collate;
 my $es = Unicode::Collate–>new(
     level => 1, 
     normalization => undef
 );

  # now both are true:
 $es->eq("García",  "GARCIA" );
 $es->eq("Márquez", "MARQUEZ");

=head2 Case- I<and> accent-insensitive locale comparisons

Same, but in a specific locale.

 my $de = Unicode::Collate::Locale->new(
            locale => "de__phonebook",
          );

 # now this is true:
 $de->eq("tschüß", "TSCHUESS");  # notice ü => UE, ß => SS

=head2 Unicode linebreaking

Break up text into lines according to Unicode rules.

 # cpan -i Unicode::LineBreak 
 use Unicode::LineBreak;
 use charnames qw(:full);

 my $para = "This is a super\N{HYPHEN}long string. " x 20;
 my $fmt = new Unicode::LineBreak;
 print $fmt->break($para), "\n";

=head2 Decode program arguments as utf8

     $ perl -CA ...
 or
     $ export PERL_UNICODE=A
 or
    use Encode qw(decode_utf8);
    @ARGV = map { decode_utf8($_, 1) } @ARGV;

=head2 Decode program arguments as locale encoding

    # cpan -i Encode::Locale
    use Encode qw(locale);
    use Encode::Locale;

    # use "locale" as an arg to encode/decode
    @ARGV = map { decode(locale => $_, 1) } @ARGV;

=head2 Declare STD{IN,OUT,ERR} to be utf8

Use a command-line option, an environment variable, or else
call C<binmode> explicitly:

     $ perl -CS ...
 or
     $ export PERL_UNICODE=S
 or 
     use open qw(:std :utf8);
 or
     binmode(STDIN,  ":utf8");
     binmode(STDOUT, ":utf8");
     binmode(STDERR, ":utf8");

=head2 Declare STD{IN,OUT,ERR} to locale encoding

    # cpan -i Encode::Locale
    use Encode;
    use Encode::Locale;

    # or as a stream for binmode or open
    binmode STDIN,  ":encoding(console_in)"  if –t STDIN;
    binmode STDOUT, ":encoding(console_out)" if –t STDOUT;
    binmode STDERR, ":encoding(console_out)" if –t STDERR;

=head2 Make I/O default to utf8

Include files opened without an encoding arugment.

     $ perl -CSD ...
 or
     $ export PERL_UNICODE=SD
 or 
     use open qw(:std :utf8);

=head2 Open file with encoding

Specify stream encoding.  This is the normal way
to deal with encoded text, not by calling low-level
functions.

 # input file
     open(my $in_file, "< :encoding(UTF-16)", "wintext");
 OR
     open(my $in_file, "<", "wintext");
     binmode($in_file, ":encoding(UTF-16)");
 THEN 
     my $line = <$in_file>;

 # output file
     open($out_file, "> :encoding(cp1252)", "wintext");
 OR
     open(my $out_file, ">", "wintext");
     binmode($out_file, ":encoding(cp1252)");
 THEN
     print $out_file "some text\n";

The incantation C<":raw :encoding(UTF-16LE) :crlf"> 
includes implicit CRLF handling.

=head2 Explicit encode/decode  

On very rare occasion, such as a database read, you may be
given encoded text you need to decode.

  use Encode qw(encode decode);

  my $chars = decode("shiftjis", $bytes);
 # OR
  my $bytes = encode("MIME–Header–ISO_2022_JP", $chars);

But see L<DBM_Filter::utf8> for easy implicit handling of UTF‑8 
in DBM databases.

=head2 Unicode text in DBM hashes, the tedious way

Using a regular Perl string as a key or value for a DBM 
hash will trigger a wide character exception if any codepoints
won’t fit into a byte.  Here’s how to manually manage the translation:

    use DB_File;
    use Encode qw(encode decode);
    tie %dbhash, "DB_File", "pathname";

 # STORE

    # assume $uni_key and $uni_value are abstract Unicode strings
    my $enc_key   = encode("UTF–8", $uni_key);
    my $enc_value = encode("UTF–8", $uni_value);
    $dbhash{$enc_key} = $enc_value;

 # FETCH

    # assume $uni_key holds a normal Perl string (abstract Unicode)
    my $enc_key   = encode("UTF–8", $uni_key);
    my $enc_value = $dbhash{$enc_key};
    my $uni_value = decode("UTF–8", $enc_key);

=head2 Unicode text in DBM hashes, the easy way

Here’s how to implicitly manage the translation; all encoding
and decoding is done automatically, just as with streams that
have a particular encoding attached to them:

    use DB_File;
    use DBM_Filter;

    my $dbobj = tie %dbhash, "DB_File", "pathname";
    $dbobj–>Filter_Value("utf8");  # this is the magic bit

 # STORE

    # assume $uni_key and $uni_value are abstract Unicode strings
    $dbhash{$uni_key} = $uni_value;

  # FETCH

    # $uni_key holds a normal Perl string (abstract Unicode)
    my $uni_value = $dbhash{$uni_key};

=head1 SEE ALSO

See these manpages, some of which are CPAN modules:
L<perlunicode>,
L<perluniprops>,
L<perlre>,
L<perlrecharclass>,
L<perluniintro>,
L<perlunitut>,
L<perlunifaq>,
L<PerlIO>,
L<DBM_Filter::utf8>,
L<Encode>,
L<Encode::Locale>.
L<Unicode::Normalize>,
L<Unicode::GCString>,
L<Unicode::LineBreak>,
L<Unicode::Collate>,
L<Unicode::Collate::Locale>,
L<Unicode::Unihan>,
L<Lingua::JA::Romanize::Japanese>,
L<Lingua::ZH::Romanize::Pinyin>,
L<Lingua::KO::Romanize::Hangul>.

See also these portions of the Unicode Standard:

=over

=item UAX #44: Unicode Character Database

=item UTS #18: Unicode Regular Expressions

=item UAX #15: Unicode Normalization Forms

=item UTS #10: Unicode Collation Algorithm

=item UAX #29: Unicode Text Segmentation

=item UAX #14: Unicode Line Breaking Algorithm

=item UAX #11: East Asian Width

=back

=head1 AUTHOR

Tom Christiansen E<lt>tchrist@perl.comE<gt>

=head1 COPYRIGHT AND LICENCE

Copyright © 2012 Tom Christiansen. 
All rights reversed.  Use per Perl licence, blah blah.

Some code excerpts taken from the 4th Edition of I<Programming Perl>,
Copyright © 2012 blah blah 

Poetic licence, blah blah.

=head1 REVISON HISTORY

v0.3  -  added dbm, locales, and unihan


Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About