Front page | perl.perl5.porters |
Postings from August 2008
On the Good Ship Unicode (was: [perl #57800] \p{Letter} not matching unicode input)
Thread Next
From:
Tom Christiansen
Date:
August 11, 2008 22:33
Subject:
On the Good Ship Unicode (was: [perl #57800] \p{Letter} not matching unicode input)
Message ID:
6653.1218519179@chthon
In-Reply-To: Message from Mark Blackman (via RT)
<perlbug-followup@perl.org> of "Mon, 11 Aug 2008 08:13:58 PDT."
<rt-3.6.HEAD-29759-1218467638-1300.57800-75-0@perl.org>
> /^[\p{Letter}]+$/ doesn't match a cedilla C (utf8)
> perl -ne 'chomp; if (/^[\p{Letter}]+$/) { print "letter-->",$_,"\n"; }'
> and with a utf8 terminal, enter a cedilla C. U+0037
> if you drop the final end-of-string match, the match succeeds
> with a [single] cedilla C.
That's a mildy odd pattern. I wonder why you're embracketing a single
property like that? This aren't icky POSIX character classes; you don't
need the brackets.
It all depends on how you enter things. The precomposed Unicode character
LATIN CAPITAL LETTER C WITH CEDILLA is at code point 0xC7, while the LATIN
SMALL LETTER C WITH CEDILLA is at 0xE7. I guess if you flip your 3 the
other way it looks like an E, but ....
The problem is surely how you're entering data (code point such and such
says nothing about the physical representation of the logical number),
something you did *not* specify, which makes it harder to know for sure.
But your envariables don't look culpable.
You really should still consult the perlrun manpage:
"-C" on its own (not followed by any number or option
list), or the empty string "" for the "PERL_UNICODE"
environment variable, has the same effect as "-CSDL".
In other words, the standard I/O handles and the
default "open()" layer are UTF-8-fied but only if the
locale environment variables indicate a UTF-8 locale.
This behaviour follows the implicit (and problematic)
UTF-8 behaviour of Perl 5.8.0.
So at the initial 8.0 release of perl5, we went through a time when
things were um, a little too quick to jump at your envariables, and
many a train-wreck ensued. You *probably* don't want that, any more
than you want the L flag, which is documented to
L 64 normally the "IOEioA" are unconditional, the L makes
them conditional on the locale environment variables
(the LC_ALL, LC_TYPE, and LANG, in the order of decreasing
precedence) -- if the variables indicate UTF-8, then the
selected "IOEioA" are in effect
For your education, edification, and indeed, even amusement, I offer up the
following program that explicitly sets things (read: streams; I/O layers)
in a Unicodey way--something I can't see that you did--and only then goes
about sniffing around to decide whether a string is "letterishlike".
See, even then, you're still going to need to be just a *wee* bit more
snerpickety in your inspection. But fear not, for the key to getting
this final part right is provided in a comment by itself right at the
very top of the demo program below.
*DO* please enjoy! I sure know *I* did. You may lay the blame
to this "artistic" coding whim on my recent (ahem) reading material,
which either you already know (of)--or else, surely don't care to. (:->
--tom
#!/bin/sh
# gcb - judge letterishness, plus count
# Graphemes, Characters, and Bytes
#
# Tom Christiansen <tchrist@perl.com>
# Mon Aug 11 23:09:44 MDT 2008
#======================----------->vvvvvvvvvvvvvvvvvvvv<---#
# The *KEY* to it all is simply =~ /\A(?:(?=\pL)\X)+\z/ #
#=====================----------->#^^^^^^^^^^^^^^^^^^^^<---#
#############################################################################
# Embedded cryptojest nothwithstanding, the well-commented, ASCII-art demo #
# program [ it's something of a ship if you turn your monitor sideways or #
# run it through my rot90 filter :-] I enclose below can be expected to #
# produce the following clear and illuminating Unicode output: #
#############################################################################
# 1: G⁼1 C⁼ 1 B⁼ 1 M has but \pL in U+004d
# 2: G⁼1 C⁼ 1 B⁼ 2 Μ has but \pL in U+039c
# 3: G⁼1 C⁼ 1 B⁼ 2 µ has but \pL in U+00b5
# 4: G⁼1 C⁼ 1 B⁼ 2 μ has but \pL in U+03bc
# 5: G⁼1 C⁼ 1 B⁼ 1 C has but \pL in U+0043
# 6: G⁼1 C⁼ 1 B⁼ 2 Ç has but \pL in U+00c7
# 7: G⁼1 C⁼ 2 B⁼ 3 Ç has but \pL in U+0043.0327
# 8: G⁼1 C⁼ 3 B⁼ 5 Ç̌ has but \pL in U+0043.0327.030c
# 9: G⁼1 C⁼ 2 B⁼ 4 Ç̌ has but \pL in U+00c7.030c
# 10: G⁼1 C⁼ 3 B⁼ 5 Ç̌ has but \pL in U+0043.030c.0327
# 11: G⁼1 C⁼ 2 B⁼ 5 ℯ̧ has but \pL in U+212f.0327
# 12: G⁼1 C⁼ 1 B⁼ 3 ℯ has but \pL in U+212f
# 13: G⁼1 C⁼ 1 B⁼ 3 ℛ has but \pL in U+211b
# 14: G⁼1 C⁼ 2 B⁼ 6 ℛ⃠ has but \pL in U+211b.20e0
# 15: G⁼1 C⁼ 1 B⁼ 2 Π has but \pL in U+03a0
# 16: G⁼1 C⁼ 2 B⁼ 5 ψ⃗ has but \pL in U+03c8.20d7
# 17: G⁼1 C⁼ 1 B⁼ 1 ? LACKS \pL in U+003f
# 18: G⁼1 C⁼ 1 B⁼ 2 ʔ has but \pL in U+0294
# 19: G⁼1 C⁼ 2 B⁼ 4 ʔ̴ has but \pL in U+0294.0334
# 20: G⁼1 C⁼ 1 B⁼ 2 ¿ LACKS \pL in U+00bf
# 21: G⁼2 C⁼ 2 B⁼ 5 πℯ has but \pL in U+03c0.212f
# 22: G⁼2 C⁼ 3 B⁼ 7 Φℯ̄ has but \pL in U+03a6.212f.0304
# 23: G⁼2 C⁼ 2 B⁼ 3 ¿? LACKS \pL in U+00bf.003f
# 24: G⁼2 C⁼ 2 B⁼ 4 ʕʖ has but \pL in U+0295.0296
# 25: G⁼2 C⁼ 2 B⁼ 4 ʕʔ has but \pL in U+0295.0294
# 26: G⁼2 C⁼ 2 B⁼ 5 Ⅱª LACKS \pL in U+2161.00aa
# 27: G⁼3 C⁼ 3 B⁼ 4 IIª has but \pL in U+0049.0049.00aa
# 28: G⁼4 C⁼ 4 B⁼10 Ψℯ⁻¹ LACKS \pL in U+03a8.212f.207b.00b9
# 29: G⁼4 C⁼ 4 B⁼ 5 Cómo has but \pL in U+0043.00f3.006d.006f
# 30: G⁼4 C⁼ 5 B⁼ 6 Cómo has but \pL in U+0043.006f.0301.006d.006f
# 31: G⁼6 C⁼ 7 B⁼ 9 ¿Cómo? LACKS \pL in U+00bf.0043.006f.0301.006d.006f.003f
# 32: G⁼6 C⁼ 7 B⁼10 ʖCómoʔ has but \pL in U+0296.0043.006f.0301.006d.006f.0294
# 33: G⁼6 C⁼14 B⁼24 ʖ̲C̲ó̲̲m̲o̲ʔ̲ has but \pL in U+0296.0332.0043.0332.006f.0332.0301.0332.006d.0332.006f.0332.0294.0332
# 34: G⁼6 C⁼ 6 B⁼ 9 wrǽþþu has but \pL in U+0077.0072.01fd.00fe.00fe.0075
# 35: G⁼6 C⁼ 6 B⁼ 9 WRǼÞÞU has but \pL in U+0057.0052.01fc.00de.00de.0055
# 36: G⁼6 C⁼ 7 B⁼11 wrǽþþu has but \pL in U+0077.0072.00e6.0301.00fe.00fe.0075
# 37: G⁼6 C⁼ 7 B⁼11 WRǼÞÞU has but \pL in U+0057.0052.00c6.0301.00de.00de.0055
# 38: G⁼7 C⁼ 7 B⁼ 8 laȝamon has but \pL in U+006c.0061.021d.0061.006d.006f.006e
# 39: G⁼7 C⁼ 7 B⁼ 8 LAȜAMON has but \pL in U+004c.0041.021c.0041.004d.004f.004e
# 40: G⁼6 C⁼ 6 B⁼ 8 tschüß has but \pL in U+0074.0073.0063.0068.00fc.00df
# 41: G⁼6 C⁼ 7 B⁼ 9 tschüß has but \pL in U+0074.0073.0063.0068.0075.0308.00df
# 42: G⁼6 C⁼ 6 B⁼ 8 ßühcsT has but \pL in U+00df.00fc.0068.0063.0073.0054
# 43: G⁼7 C⁼ 7 B⁼ 8 TSCHÜSS has but \pL in U+0054.0053.0043.0048.00dc.0053.0053
# 44: G⁼7 C⁼ 8 B⁼ 9 TSCHÜSS has but \pL in U+0054.0053.0043.0048.0055.0308.0053.0053
# 45: G⁼7 C⁼ 8 B⁼ 9 Ss̈uhcst has but \pL in U+0053.0073.0308.0075.0068.0063.0073.0074
# 46: G⁼7 C⁼ 8 B⁼ 9 Ssühcst has but \pL in U+0053.0073.0075.0308.0068.0063.0073.0074
# 47: G⁼8 C⁼ 8 B⁼10 coŀleció has but \pL in U+0063.006f.0140.006c.0065.0063.0069.00f3
# 48: G⁼8 C⁼ 8 B⁼10 ÓiceĿloC has but \pL in U+00d3.0069.0063.0065.013f.006c.006f.0043
# 49: G⁼8 C⁼ 9 B⁼11 coŀleció has but \pL in U+0063.006f.0140.006c.0065.0063.0069.006f.0301
# 50: G⁼8 C⁼ 9 B⁼11 COĿLECIÓ has but \pL in U+0043.004f.013f.004c.0045.0043.0049.004f.0301
# 51: G⁼9 C⁼10 B⁼12 col·leció LACKS \pL in U+0063.006f.006c.00b7.006c.0065.0063.0069.006f.0301
# 52: G⁼9 C⁼10 B⁼12 COL·LECIÓ LACKS \pL in U+0043.004f.004c.00b7.004c.0045.0043.0049.004f.0301
perl -CS -Mcharnames=:full,:short -le 'print for(
"M","\N{Greek:Mu}","\x{B5}","\N{Greek:mu}","C",
"\N{Latin:C WITH CEDILLA}","C\N{COMBINING CEDILLA}",
"C\N{COMBINING CEDILLA}\N{COMBINING CARON}",
"\N{Latin:C WITH CEDILLA}\N{COMBINING CARON}",
"C\N{COMBINING CARON}\N{COMBINING CEDILLA}",
"\N{SCRIPT SMALL E}\N{COMBINING CEDILLA}",
"\N{SCRIPT SMALL E}","\N{SCRIPT CAPITAL R}",
"\N{SCRIPT CAPITAL R}\N{COMBINING ENCLOSING CIRCLE BACKSLASH}",
"\N{Greek:Pi}","\N{Greek:psi}\N{COMBINING RIGHT ARROW ABOVE}","?",
"\N{LATIN LETTER GLOTTAL STOP}","\N{LATIN LETTER GLOTTAL STOP}".
"\N{COMBINING TILDE OVERLAY}","\x{bf}","\N{Greek:pi}\N{SCRIPT SMALL E}",
"\N{Greek:Phi}\N{SCRIPT SMALL E}\N{COMBINING MACRON}","\x{bf}?",
"\N{LATIN LETTER PHARYNGEAL VOICED FRICATIVE}".
"\N{LATIN LETTER INVERTED GLOTTAL STOP}",
"\N{LATIN LETTER PHARYNGEAL VOICED FRICATIVE}".
"\N{LATIN LETTER GLOTTAL STOP}","\x{2161}\x{aa}","II\x{aa}",
"\N{Greek:Psi}\N{SCRIPT SMALL E}\N{SUPERSCRIPT MINUS}\N{SUPERSCRIPT ONE}",
"C\N{Latin:o with acute}mo","Co\N{COMBINING ACUTE ACCENT}mo",
"\x{bf}Co\N{COMBINING ACUTE ACCENT}mo?",
"\N{LATIN LETTER INVERTED GLOTTAL STOP}Co\N{COMBINING ACUTE ACCENT}".
"mo\N{LATIN LETTER GLOTTAL STOP}",
"\N{LATIN LETTER INVERTED GLOTTAL STOP}\x{332}C\x{332}o\x{332}".
"\N{COMBINING ACUTE ACCENT}\x{332}m\x{332}o\x{332}".
"\N{LATIN LETTER GLOTTAL STOP}\x{332}",
"wr\x{1fd}\N{Latin:thorn}\N{Latin:thorn}u",
"\Uwr\x{1fd}\N{Latin:thorn}\N{Latin:thorn}u",
"wr\x{e6}\x{301}\N{Latin:thorn}\N{Latin:thorn}u",
"\Uwr\x{e6}\x{301}\N{Latin:thorn}\N{Latin:thorn}u",
"la\N{Latin:yogh}amon",uc"La\N{Latin:yogh}amon",
"tsch\N{Latin:u with diaeresis}\x{df}",
"tschu\N{COMBINING DIAERESIS}\x{df}",scalar reverse(
"\utsch\N{Latin:u with diaeresis}\x{df}"),
"\Utsch\N{Latin:u with diaeresis}\x{df}",
"\Utschu\N{COMBINING DIAERESIS}\x{df}",ucfirst scalar reverse(
"tschu\N{COMBINING DIAERESIS}\x{df}"),ucfirst reverse(scalar reverse reverse
"tschu\N{COMBINING DIAERESIS}\x{df}"=~/(?#YANETUT)\X/g),
"co\x{140}leci\N{Latin:o with acute}",ucfirst reverse(
"\ucol\u\x{140}eci\N{Latin:o with acute}"),
"co\x{140}lecio\N{COMBINING ACUTE ACCENT}",
"\Uco\x{140}lecio\N{COMBINING ACUTE ACCENT}",
"col\x{b7}lecio\N{COMBINING ACUTE ACCENT}",
"\Ucol\x{b7}lecio\N{COMBINING ACUTE ACCENT}",
)'|perl -CS -mbytes -nle '(($m)=m=\A(\X+)\z=)||die;
printf"%2d: G\x{207C}%d C\x{207C}%2d B\x{207C}%2d".
"\t%s\t%s\t\\pL in U+%v04x \n",++$i,scalar(()=
$m=~m~\X~g),(length$m,bytes::length$m,$m),$m=~m
"\A(?:(?=\pL)\X)+\z"?"has but":"LACKS",$m,
;';
Thread Next