Front page | perl.perl5.porters |
Postings from February 2012
RFC & PROPOSAL: add perlunicook.pod to std docset
Thread Next
From:
Tom Christiansen
Date:
February 24, 2012 19:35
Subject:
RFC & PROPOSAL: add perlunicook.pod to std docset
Message ID:
31258.1330140850@chthon
After mining Camel4 some more, I've added recipes for dbm, locales, and
unihan, puting me up to 42 recipes, which seems a good place to stop. Is
there any reason why this should *not* be included as part of the standard
Perl documentation? It's something of a FAQ, something of a cheat-sheet,
something of a cookbook. Here, take two and call me in the morning. 😷
Most of this is supposed to be about making easy stuff easy, although a bit is
admittedly about making hard stuff possible. But no matter how you look at
it, it's frightening how little of this can be done *at all*, let alone with
this amount of slickness, in any other programming language. Most of this
is just impossible anywhere else.
℞ 0. Standard preamble
℞ 1. Generic Unicode-savvy filter
℞ 2. Fine-tuning Unicode warnings
℞ 3. Declare source in utf8 for identifiers and literals
℞ 4. Characters and their numbers
℞ 5. Unicode literals by character number
℞ 6. Get character name by number
℞ 7. Get character number by name
℞ 8. Unicode named characters
℞ 9. Unicode named sequences
℞10. Custom named characters
℞11. Names of CJK codepoints
℞12. Unicode casing
℞13. Unicode case-insensitive comparisons
℞14. Match Unicode linebreak sequence in regex
℞15. Get character category
℞16. Disabling Unicode-awareness in builtin charclasses
℞17. Match Unicode properties in regex with \p, \P
℞18. Custom character properties
℞19. Convert non-ASCII Unicode numerics
℞20. Match Unicode grapheme cluster in regex
℞21. Extract by grapheme instead of by codepoint (regex)
℞22. Extract by grapheme instead of by codepoint (substr)
℞23. Reverse string by grapheme
℞24. String length in graphemes
℞25. Unicode column-width for printing
℞26. Unicode normalization
℞27. Unicode collation
℞28. Case- *and* accent-insensitive Unicode sort
℞29. Unicode locale collation
℞30. Making "cmp" work on text instead of codepoints
℞31. Case- *and* accent-insensitive comparisons
℞32. Case- *and* accent-insensitive locale comparisons
℞33. Unicode linebreaking
℞34. Decode program arguments as utf8
℞35. Decode program arguments as locale encoding
℞36. Declare STD{IN,OUT,ERR} to be utf8
℞37. Declare STD{IN,OUT,ERR} to locale encoding
℞38. Make all I/O default to utf8
℞39. Open file with encoding
℞40. Explicit encode/decode
℞41. Unicode text in DBM hashes, the tedious way
℞42. Unicode text in DBM hashes, the easy way
I'm not sure about the order: ℞34–39 seem like they should go a lot earlier.
Yes, it’s unabashedly heavy with Unicode, even brazen. But it has to be, or it
won't work. The local man system might stand with some fine-tuning (we screw
up on UTF-8 manpages you know!!), since I don't think we've been quite so
brazen with this sort of thing before.
$ uniwc perlunicook.pod
Paras Lines Words Graphs Chars Bytes File
200 656 2202 16426 16428 16643 perlunicook.pod
$ pod2text unicook.pod | uniwc
Paras Lines Words Graphs Chars Bytes File
142 576 2142 17419 17421 17636 standard input
Yes, it's in NFC — that’s as few GCSes as I can squash it down to.
--tom
=encoding utf8
=head1 NAME
perlunicook - cookbookish examples of handling Unicode in Perl
=head1 DESCRIPTION
Unless otherwise notes, all examples below assume this standard
preamble, with the C<#!> adjusted to work on your system:
#!/usr/bin/env perl
use utf8;
use v5.12; # or later
use strict;
use warnings;
use warnings qw(FATAL utf8);
use open qw(:std :utf8);
use charnames qw(:full :short); # unneeded in v5.16
This does makes you C<binmode> your binary streams, or open them
with C<:raw>, but that's the only way to get at them portably anyway.
B<WARNING>: C<use autoload> and C<use open> do not get along with each other.
=head1 EXAMPLES
=head2 Generic Unicode-savvy filter
Always decompose on the way in, then recompose on the way out.
use Unicode::Normalize;
while (<>) {
$_ = NFD($_);
...
} continue {
print NFC($_);
}
=head2 Fine-tuning Unicode warnings
As of v5.14, Perl distinguishes three sublasses of UTF‑8 warnings.
use v5.14;
no warnings "nonchars"; # the 66 forbidden characters
no warnings "surrogates"; # UTF-16/CESU-8 nonsense
no warnings "non_unicode"; # for codepoints over 0x10_FFFF
=head2 Declare source in utf8 for identifiers and literals
Without the all-critical C<use utf8> declaration, putting UTF‑8 in your
literals and identifiers won’t work right. If you used the standard
preamble just given above, this already happened. If you did, you can
do things like this:
use utf8;
my $measure = "Ångström";
my @μsoft = qw( cp852 cp1251 cp1252 );
my @ὑπέρμεγας = qw( ὑπέρ μεγας );
my @鯉 = qw( koi8–f koi8–u koi8–r );
my $motto = "👪 💗 🐪"; # FAMILY, GROWING HEART, DROMEDARY CAMEL
=head2 Characters and their numbers
The C<ord> and C<chr> functions work transparently on all codepoints.
# ASCII characters
ord("A")
chr(65)
# characters from the Basic Multilingual Plane
ord("Σ")
chr(0x3A3)
# beyond the BMP
ord("𝑛")
chr(0x1D45B)
# beyond Unicode! (up to MAXINT)
ord("\x{20_0000}")
chr(0x20_0000)
=head2 Unicode literals by character number
In a literal, you may specify a character by its number
using the C<\x{I<HHHHHH>}> escape.
String: "\x{3a3}"
Regex: /\x{3a3}/
String: "\x{1d45b}"
Regex: /\x{1d45b}/
# even non-BMP ranges in regex work fine
/[\x{1D434}-\x{1D467}]/
=head2 Get character name by number
use charnames ();
my $name = charnames::viacode(0x03A3);
=head2 Get character number by name
use charnames ();
my $number = charnames::vianame("GREEK CAPITAL LETTER SIGMA");
=head2 Unicode named characters
In v5.16, there is an implicit
use charnames qw(:full :short);
But prior to that release, you must be explicit about which charnames you
want. You should still specify a script if you want short names that are
script-specific.
use charnames qw(:full :short greek);
"\N{MATHEMATICAL ITALIC SMALL N}" # :full
"\N{GREEK CAPITAL LETTER SIGMA}" # :full
"\N{Greek:Sigma}" # :short
"\N{epsilon}" # greek
The v5.16 release also supports a C<:loose> import for loose matching of
character names.
=head2 Unicode named sequences
These look just like character names but return multiple codepoints.
Notice the C<%vx> vector-print functionality in C<printf>.
use charnames qw(:full);
my $seq = "\N{LATIN CAPITAL LETTER A WITH MACRON AND GRAVE}";
printf "U+%v04X\n", $seq;
U+0100.0300
=head2 Custom named characters
Give your own nicknames to existing characters, or to unnamed
private-use characters.
use charnames ":full", ":alias" => {
ecute => "LATIN SMALL LETTER E WITH ACUTE",
"APPLE LOGO" => 0xF8FF, # private use character
};
"\N{ecute}"
"\N{APPLE LOGO}"
=head2 Names of CJK codepoints
Sinograms like “東京” come back with character names of
C<CJK UNIFIED IDEOGRAPH-6771> and C<CJK UNIFIED IDEOGRAPH-4EAC>,
because their “names” vary. The CPAN C<Unicode::Unihan> module
has a large database for decoding these, provided you know how
to understand its output.
# cpan -i Unicode::Unihan
use Unicode::Unihan;
my $str = "東京";
my $unhan = new Unicode::Unihan;
for my $lang (qw(Mandarin Cantonese Korean JapaneseOn JapaneseKun)) {
printf "CJK $str in %-12s is ", $lang;
say $unhan->$lang($str);
}
prints:
CJK 東京 in Mandarin is DONG1JING1
CJK 東京 in Cantonese is dung1ging1
CJK 東京 in Korean is TONGKYENG
CJK 東京 in JapaneseOn is TOUKYOU KEI KIN
CJK 東京 in JapaneseKun is HIGASHI AZUMAMIYAKO
If you have a specific romanization scheme in mind,
use the specific module:
# cpan -i Lingua::JA::Romanize::Japanese
use Lingua::JA::Romanize::Japanese;
my $k2r = new Lingua::JA::Romanize::Japanese;
my $str = "東京";
say "Japanese for $str is ", $k2r->chars($str);
prints
Japanese for 東京 is toukyou
=head2 Unicode casing
Unicode casing is very different from ASCII casing.
uc("henry ⅷ") # "HENRY Ⅷ"
uc("tschüß") # "TSCHÜSS" notice ß => SS
# both are true:
"tschüß" =~ /TSCHÜSS/i # notice ß => SS
"Σίσυφος" =~ /ΣΊΣΥΦΟΣ/i # notice Σ,σ,ς sameness
=head2 Unicode case-insensitive comparisons
Also available in the CPAN L<Unicode::CaseFold> module,
the new C<fc> “foldcase” function from v5.16 grants
access to the same Unicode casefolding as the C</i>
pattern modifier has always used:
use feature "fc"; # fc() function is from v5.16
# sort case-insensitively
my @sorted = sort { fc($a) cmp fc($b) } @list;
# both are true:
fc("tschüß") eq fc("TSCHÜSS")
fc("Σίσυφος") eq fc("ΣΊΣΥΦΟΣ")
=head2 Match Unicode linebreak sequence in regex
A Unicode linebreak matches the two-character CRLF
grapheme or any of seven vertical whitespace characters.
Good for dealing with intransigent Microsoft systems.
\R
s/\R/\n/g; # normalize all linebreaks to \n
=head2 Get character category
Find the general category of a numeric codepoint.
use Unicode::UCD qw(charinfo);
my $cat = charinfo(0x3A3)->{category}; # "Lu"
=head2 Disabling Unicode-awareness in builtin charclasses
Disable C<\w>, C<\b>, C<\s>, C<\d>, and the POSIX
classes from working correctly on Unicode.
use v5.14;
use re "/a";
# OR
my($num) = $str =~ /(\d+)/a;
Or just use specific un-Unicode properties, like C<\p{ahex}>
and C<\p{posix_digit>}. Properties still work normally
no matter what.
=head2 Match Unicode properties in regex with \p, \P
These all match a single codepoint with the given
property. Use C<\P> in place of C<\p> to match
one lacking that property.
\pL, \pN, \pS, \pP, \pM, \pZ, \pC
\p{Sk}, \p{Ps}, \p{Lt}
\p{alpha}, \p{upper}, \p{lower}
\p{Latin}, \p{Greek}
\p{script=Latin}, \p{script=Greek}
\p{East_Asian_Width=Wide}, \p{EA=W}
\p{Line_Break=Hyphen}, \p{LB=HY}
\p{Numeric_Value=4}, \p{NV=4}
=head2 Custom character properties
Define at compile-time your own custom character
properties for use in regexes.
# using private-use characters
sub In_Tengwar { "E000\tE07F\n" }
if (/\p{In_Tengwar}/) { ... }
# blending existing properties
sub Is_GraecoRoman_Title {<<'END_OF_SET'}
+utf8::IsLatin
+utf8::IsGreek
&utf8::IsTitle
END_OF_SET
if (/\p{Is_GraecoRoman_Title}/ { ... }
=head2 Convert non-ASCII Unicode numerics
Unless you’ve used C</a>, C<\d> matches more than ASCII digits.
use v5.14; # needed for num() function
use Unicode::UCD qw(num);
my $str = "got Ⅻ and ४५६७ and ⅞ and here";
my @nums = ();
while (/$str =~ (\d+|\N)/g) { # not just ASCII!
push @nums, num($1);
}
say "@nums"; # 12 4567 0.875
use charnames qw(:full);
my $nv = num("\N{RUMI DIGIT ONE}\N{RUMI DIGIT TWO}");
=head2 Match Unicode grapheme cluster in regex
Programmer-visible “characters” are codepoints matched by C</./s>,
but yser-visible “characters” are graphemes matched by C</\X/>.
# Find vowel plus any diacritics,underlining,etc.
my $nfd = NFD($orig);
$nfd =~ /(?=[aeiou])\X/i
=head2 Extract by grapheme instead of by codepoint (regex)
# match and grab five first graphemes
my($first_five) = $str =~ /^(\X{5})/;
=head2 Extract by grapheme instead of by codepoint (substr)
# cpan -i Unicode::GCString
use Unicode::GCString;
my $gcs = Unicode::GCString->new($str);
my $first_five = $gcs->substr(0, 5);
=head2 Reverse string by grapheme
Reversing by codepoint messes up diacritics, mistakenly converting
C<crème brûlée> into C<éel̂urb em̀erc> instead of into C<eélûrb emèrc>;
so reverse by grapheme instead. Both these approaches work
right no matter what normalization the string is in:
$str = join("", reverse $str =~ /\X/g);
# OR: cpan -i Unicode::GCString
use Unicode::GCString;
$str = reverse Unicode::GCString->new($str);
=head2 String length in graphemes
Count by grapheme, not by codepoint.
my $count = 0;
while ($str =~ /\X/) { $count++ }
# OR: cpan -i Unicode::GCString
use Unicode::GCString;
$gcs = Unicode::GCString->new($str);
my $count = $gcs->length;
=head2 Unicode column-width for printing
Perl’s C<printf>, C<scriptf>, and C<format> think all
codepoints take up 1 print column, but many take 0 or 2.
# cpan -i Unicode::GCString
use Unicode::GCString;
$gcs = Unicode::GCString->new($str);
my $cols = $gcs->columns;
printf "%*s\n", $cols, $str,
=head2 Unicode normalization
Typically render into NFD on input and NFC on output.
Using either of the NFK functions improves recall on searches.
Note that this is about much more than just pre-combined compatibility glyphs.
use Unicode::Normalize;
my $nfd = NFD($orig);
my $nfc = NFC($orig);
my $nfkd = NFKD($orig);
my $nfkc = NFKC($orig);
=head2 Unicode collation
Text sorted by numeric codepoint follows no reasonable order;
use the UCA for sorting text.
use Unicode::Collate;
my $col = Unicode::Collate->new();
my @list = $col->sort(@old_list);
=head2 Case- I<and> accent-insensitive Unicode sort
Specify a collation strength of level 1 to ignore case and
diacritics, only looking at the basic character.
use Unicode::Collate;
my $col = Unicode::Collate->new(level => 1);
my @list = $col->sort(@old_list);
=head2 Unicode locale collation
Some locales have special sorting rules.
# either use v5.12, OR: cpan -i Unicode::Collate::Locale
use Unicode::Collate::Locale;
my $col = Unicode::Collate::Locale->new(locale => "de__phonebook");
my @list = $col->sort(@old_list);
=head2 Making C<cmp> work on text instead of codepoints
Instead of this:
@srecs = sort {
$b->{AGE} <=> $a->{AGE}
||
$a->{NAME} cmp $b->{NAME}
} @recs;
Use this:
my $coll = Unicode::Collate->new();
for my $rec (@recs) {
$rec->{NAME_key} = $coll->getSortKey( $rec->{NAME} );
}
@srecs = sort {
$b->{AGE} <=> $a->{AGE}
||
$a->{NAME_key} cmp $b->{NAME_key}
} @recs;
=head2 Case- I<and> accent-insensitive comparisons
Use a collator object to compare Unicode text by character
instead of by codepoint.
use Unicode::Collate;
my $es = Unicode::Collate–>new(
level => 1,
normalization => undef
);
# now both are true:
$es->eq("García", "GARCIA" );
$es->eq("Márquez", "MARQUEZ");
=head2 Case- I<and> accent-insensitive locale comparisons
Same, but in a specific locale.
my $de = Unicode::Collate::Locale->new(
locale => "de__phonebook",
);
# now this is true:
$de->eq("tschüß", "TSCHUESS"); # notice ü => UE, ß => SS
=head2 Unicode linebreaking
Break up text into lines according to Unicode rules.
# cpan -i Unicode::LineBreak
use Unicode::LineBreak;
use charnames qw(:full);
my $para = "This is a super\N{HYPHEN}long string. " x 20;
my $fmt = new Unicode::LineBreak;
print $fmt->break($para), "\n";
=head2 Decode program arguments as utf8
$ perl -CA ...
or
$ export PERL_UNICODE=A
or
use Encode qw(decode_utf8);
@ARGV = map { decode_utf8($_, 1) } @ARGV;
=head2 Decode program arguments as locale encoding
# cpan -i Encode::Locale
use Encode qw(locale);
use Encode::Locale;
# use "locale" as an arg to encode/decode
@ARGV = map { decode(locale => $_, 1) } @ARGV;
=head2 Declare STD{IN,OUT,ERR} to be utf8
Use a command-line option, an environment variable, or else
call C<binmode> explicitly:
$ perl -CS ...
or
$ export PERL_UNICODE=S
or
use open qw(:std :utf8);
or
binmode(STDIN, ":utf8");
binmode(STDOUT, ":utf8");
binmode(STDERR, ":utf8");
=head2 Declare STD{IN,OUT,ERR} to locale encoding
# cpan -i Encode::Locale
use Encode;
use Encode::Locale;
# or as a stream for binmode or open
binmode STDIN, ":encoding(console_in)" if –t STDIN;
binmode STDOUT, ":encoding(console_out)" if –t STDOUT;
binmode STDERR, ":encoding(console_out)" if –t STDERR;
=head2 Make I/O default to utf8
Include files opened without an encoding arugment.
$ perl -CSD ...
or
$ export PERL_UNICODE=SD
or
use open qw(:std :utf8);
=head2 Open file with encoding
Specify stream encoding. This is the normal way
to deal with encoded text, not by calling low-level
functions.
# input file
open(my $in_file, "< :encoding(UTF-16)", "wintext");
OR
open(my $in_file, "<", "wintext");
binmode($in_file, ":encoding(UTF-16)");
THEN
my $line = <$in_file>;
# output file
open($out_file, "> :encoding(cp1252)", "wintext");
OR
open(my $out_file, ">", "wintext");
binmode($out_file, ":encoding(cp1252)");
THEN
print $out_file "some text\n";
The incantation C<":raw :encoding(UTF-16LE) :crlf">
includes implicit CRLF handling.
=head2 Explicit encode/decode
On very rare occasion, such as a database read, you may be
given encoded text you need to decode.
use Encode qw(encode decode);
my $chars = decode("shiftjis", $bytes);
# OR
my $bytes = encode("MIME–Header–ISO_2022_JP", $chars);
But see L<DBM_Filter::utf8> for easy implicit handling of UTF‑8
in DBM databases.
=head2 Unicode text in DBM hashes, the tedious way
Using a regular Perl string as a key or value for a DBM
hash will trigger a wide character exception if any codepoints
won’t fit into a byte. Here’s how to manually manage the translation:
use DB_File;
use Encode qw(encode decode);
tie %dbhash, "DB_File", "pathname";
# STORE
# assume $uni_key and $uni_value are abstract Unicode strings
my $enc_key = encode("UTF–8", $uni_key);
my $enc_value = encode("UTF–8", $uni_value);
$dbhash{$enc_key} = $enc_value;
# FETCH
# assume $uni_key holds a normal Perl string (abstract Unicode)
my $enc_key = encode("UTF–8", $uni_key);
my $enc_value = $dbhash{$enc_key};
my $uni_value = decode("UTF–8", $enc_key);
=head2 Unicode text in DBM hashes, the easy way
Here’s how to implicitly manage the translation; all encoding
and decoding is done automatically, just as with streams that
have a particular encoding attached to them:
use DB_File;
use DBM_Filter;
my $dbobj = tie %dbhash, "DB_File", "pathname";
$dbobj–>Filter_Value("utf8"); # this is the magic bit
# STORE
# assume $uni_key and $uni_value are abstract Unicode strings
$dbhash{$uni_key} = $uni_value;
# FETCH
# $uni_key holds a normal Perl string (abstract Unicode)
my $uni_value = $dbhash{$uni_key};
=head1 SEE ALSO
See these manpages, some of which are CPAN modules:
L<perlunicode>,
L<perluniprops>,
L<perlre>,
L<perlrecharclass>,
L<perluniintro>,
L<perlunitut>,
L<perlunifaq>,
L<PerlIO>,
L<DBM_Filter::utf8>,
L<Encode>,
L<Encode::Locale>.
L<Unicode::Normalize>,
L<Unicode::GCString>,
L<Unicode::LineBreak>,
L<Unicode::Collate>,
L<Unicode::Collate::Locale>,
L<Unicode::Unihan>,
L<Lingua::JA::Romanize::Japanese>,
L<Lingua::ZH::Romanize::Pinyin>,
L<Lingua::KO::Romanize::Hangul>.
See also these portions of the Unicode Standard:
=over
=item UAX #44: Unicode Character Database
=item UTS #18: Unicode Regular Expressions
=item UAX #15: Unicode Normalization Forms
=item UTS #10: Unicode Collation Algorithm
=item UAX #29: Unicode Text Segmentation
=item UAX #14: Unicode Line Breaking Algorithm
=item UAX #11: East Asian Width
=back
=head1 AUTHOR
Tom Christiansen E<lt>tchrist@perl.comE<gt>
=head1 COPYRIGHT AND LICENCE
Copyright © 2012 Tom Christiansen.
All rights reversed. Use per Perl licence, blah blah.
Some code excerpts taken from the 4th Edition of I<Programming Perl>,
Copyright © 2012 blah blah
Poetic licence, blah blah.
=head1 REVISON HISTORY
v0.3 - added dbm, locales, and unihan
Thread Next
-
RFC & PROPOSAL: add perlunicook.pod to std docset
by Tom Christiansen