develooper Front page | perl.perl5.porters | Postings from August 2008

On the Good Ship Unicode (was: [perl #57800] \p{Letter} not matching unicode input)

Thread Next
From:
Tom Christiansen
Date:
August 11, 2008 22:33
Subject:
On the Good Ship Unicode (was: [perl #57800] \p{Letter} not matching unicode input)
Message ID:
6653.1218519179@chthon
In-Reply-To: Message from Mark Blackman (via RT)
    <perlbug-followup@perl.org> of "Mon, 11 Aug 2008 08:13:58 PDT."
   <rt-3.6.HEAD-29759-1218467638-1300.57800-75-0@perl.org>

> /^[\p{Letter}]+$/ doesn't match a cedilla C (utf8)

> perl -ne 'chomp; if (/^[\p{Letter}]+$/) { print "letter-->",$_,"\n"; }'

> and with a utf8 terminal, enter a cedilla C. U+0037

> if you drop the final end-of-string match, the match succeeds
> with a [single] cedilla C.

That's a mildy odd pattern.  I wonder why you're embracketing a single
property like that?  This aren't icky POSIX character classes; you don't
need the brackets.

It all depends on how you enter things.  The precomposed Unicode character
LATIN CAPITAL LETTER C WITH CEDILLA is at code point 0xC7, while the LATIN
SMALL LETTER C WITH CEDILLA is at 0xE7.  I guess if you flip your 3 the
other way it looks like an E, but ....

The problem is surely how you're entering data (code point such and such
says nothing about the physical representation of the logical number),
something you did *not* specify, which makes it harder to know for sure.
But your envariables don't look culpable.

You really should still consult the perlrun manpage:

    "-C" on its own (not followed by any number or option
    list), or the empty string "" for the "PERL_UNICODE"
    environment variable, has the same effect as "-CSDL".
    In other words, the standard I/O handles and the
    default "open()" layer are UTF-8-fied but only if the
    locale environment variables indicate a UTF-8 locale.
    This behaviour follows the implicit (and problematic)
    UTF-8 behaviour of Perl 5.8.0.

So at the initial 8.0 release of perl5, we went through a time when 
things were um, a little too quick to jump at your envariables, and 
many a train-wreck ensued.  You *probably* don't want that, any more 
than you want the L flag, which is documented to

    L    64   normally the "IOEioA" are unconditional, the L makes
              them conditional on the locale environment variables 
              (the LC_ALL, LC_TYPE, and LANG, in the order of decreasing
              precedence) -- if the variables indicate UTF-8, then the
              selected "IOEioA" are in effect

For your education, edification, and indeed, even amusement, I offer up the
following program that explicitly sets things (read: streams; I/O layers)
in a Unicodey way--something I can't see that you did--and only then goes
about sniffing around to decide whether a string is "letterishlike".

See, even then, you're still going to need to be just a *wee* bit more
snerpickety in your inspection.  But fear not, for the key to getting 
this final part right is provided in a comment by itself right at the 
very top of the demo program below.

*DO* please enjoy!  I sure know *I* did. You may lay the blame 
to this "artistic" coding whim on my recent (ahem) reading material, 
which either you already know (of)--or else, surely don't care to. (:-> 

--tom

#!/bin/sh

# gcb - judge letterishness, plus count 
#       Graphemes, Characters, and Bytes
#
# Tom Christiansen <tchrist@perl.com>
# Mon Aug 11 23:09:44 MDT 2008

    #======================----------->vvvvvvvvvvvvvvvvvvvv<---#
    # The *KEY* to it all is simply =~ /\A(?:(?=\pL)\X)+\z/    #
    #=====================----------->#^^^^^^^^^^^^^^^^^^^^<---#

#############################################################################
# Embedded cryptojest nothwithstanding, the well-commented, ASCII-art demo  #
# program [ it's something of a ship if you turn your monitor sideways or   #
# run it through my rot90 filter :-] I enclose below can be expected to     #
# produce the following clear and illuminating Unicode output:              #
#############################################################################


#  1: G⁼1 C⁼ 1 B⁼ 1	M	has but	\pL in U+004d  
#  2: G⁼1 C⁼ 1 B⁼ 2	Μ	has but	\pL in U+039c  
#  3: G⁼1 C⁼ 1 B⁼ 2	µ	has but	\pL in U+00b5  
#  4: G⁼1 C⁼ 1 B⁼ 2	μ	has but	\pL in U+03bc  
#  5: G⁼1 C⁼ 1 B⁼ 1	C	has but	\pL in U+0043  
#  6: G⁼1 C⁼ 1 B⁼ 2	Ç	has but	\pL in U+00c7  
#  7: G⁼1 C⁼ 2 B⁼ 3	Ç	has but	\pL in U+0043.0327  
#  8: G⁼1 C⁼ 3 B⁼ 5	Ç̌	has but	\pL in U+0043.0327.030c  
#  9: G⁼1 C⁼ 2 B⁼ 4	Ç̌	has but	\pL in U+00c7.030c  
# 10: G⁼1 C⁼ 3 B⁼ 5	Ç̌	has but	\pL in U+0043.030c.0327  
# 11: G⁼1 C⁼ 2 B⁼ 5	ℯ̧	has but	\pL in U+212f.0327  
# 12: G⁼1 C⁼ 1 B⁼ 3	ℯ	has but	\pL in U+212f  
# 13: G⁼1 C⁼ 1 B⁼ 3	ℛ	has but	\pL in U+211b  
# 14: G⁼1 C⁼ 2 B⁼ 6	ℛ⃠	has but	\pL in U+211b.20e0  
# 15: G⁼1 C⁼ 1 B⁼ 2	Π	has but	\pL in U+03a0  
# 16: G⁼1 C⁼ 2 B⁼ 5	ψ⃗	has but	\pL in U+03c8.20d7  
# 17: G⁼1 C⁼ 1 B⁼ 1	?	LACKS	\pL in U+003f  
# 18: G⁼1 C⁼ 1 B⁼ 2	ʔ	has but	\pL in U+0294  
# 19: G⁼1 C⁼ 2 B⁼ 4	ʔ̴	has but	\pL in U+0294.0334  
# 20: G⁼1 C⁼ 1 B⁼ 2	¿	LACKS	\pL in U+00bf  
# 21: G⁼2 C⁼ 2 B⁼ 5	πℯ	has but	\pL in U+03c0.212f  
# 22: G⁼2 C⁼ 3 B⁼ 7	Φℯ̄	has but	\pL in U+03a6.212f.0304  
# 23: G⁼2 C⁼ 2 B⁼ 3	¿?	LACKS	\pL in U+00bf.003f  
# 24: G⁼2 C⁼ 2 B⁼ 4	ʕʖ	has but	\pL in U+0295.0296  
# 25: G⁼2 C⁼ 2 B⁼ 4	ʕʔ	has but	\pL in U+0295.0294  
# 26: G⁼2 C⁼ 2 B⁼ 5	Ⅱª	LACKS	\pL in U+2161.00aa  
# 27: G⁼3 C⁼ 3 B⁼ 4	IIª	has but	\pL in U+0049.0049.00aa  
# 28: G⁼4 C⁼ 4 B⁼10	Ψℯ⁻¹	LACKS	\pL in U+03a8.212f.207b.00b9  
# 29: G⁼4 C⁼ 4 B⁼ 5	Cómo	has but	\pL in U+0043.00f3.006d.006f  
# 30: G⁼4 C⁼ 5 B⁼ 6	Cómo	has but	\pL in U+0043.006f.0301.006d.006f  
# 31: G⁼6 C⁼ 7 B⁼ 9	¿Cómo?	LACKS	\pL in U+00bf.0043.006f.0301.006d.006f.003f  
# 32: G⁼6 C⁼ 7 B⁼10	ʖCómoʔ	has but	\pL in U+0296.0043.006f.0301.006d.006f.0294  
# 33: G⁼6 C⁼14 B⁼24	ʖ̲C̲ó̲̲m̲o̲ʔ̲	has but	\pL in U+0296.0332.0043.0332.006f.0332.0301.0332.006d.0332.006f.0332.0294.0332  
# 34: G⁼6 C⁼ 6 B⁼ 9	wrǽþþu	has but	\pL in U+0077.0072.01fd.00fe.00fe.0075  
# 35: G⁼6 C⁼ 6 B⁼ 9	WRǼÞÞU	has but	\pL in U+0057.0052.01fc.00de.00de.0055  
# 36: G⁼6 C⁼ 7 B⁼11	wrǽþþu	has but	\pL in U+0077.0072.00e6.0301.00fe.00fe.0075  
# 37: G⁼6 C⁼ 7 B⁼11	WRǼÞÞU	has but	\pL in U+0057.0052.00c6.0301.00de.00de.0055  
# 38: G⁼7 C⁼ 7 B⁼ 8	laȝamon	has but	\pL in U+006c.0061.021d.0061.006d.006f.006e  
# 39: G⁼7 C⁼ 7 B⁼ 8	LAȜAMON	has but	\pL in U+004c.0041.021c.0041.004d.004f.004e  
# 40: G⁼6 C⁼ 6 B⁼ 8	tschüß	has but	\pL in U+0074.0073.0063.0068.00fc.00df  
# 41: G⁼6 C⁼ 7 B⁼ 9	tschüß	has but	\pL in U+0074.0073.0063.0068.0075.0308.00df  
# 42: G⁼6 C⁼ 6 B⁼ 8	ßühcsT	has but	\pL in U+00df.00fc.0068.0063.0073.0054  
# 43: G⁼7 C⁼ 7 B⁼ 8	TSCHÜSS	has but	\pL in U+0054.0053.0043.0048.00dc.0053.0053  
# 44: G⁼7 C⁼ 8 B⁼ 9	TSCHÜSS	has but	\pL in U+0054.0053.0043.0048.0055.0308.0053.0053  
# 45: G⁼7 C⁼ 8 B⁼ 9	Ss̈uhcst	has but	\pL in U+0053.0073.0308.0075.0068.0063.0073.0074  
# 46: G⁼7 C⁼ 8 B⁼ 9	Ssühcst	has but	\pL in U+0053.0073.0075.0308.0068.0063.0073.0074  
# 47: G⁼8 C⁼ 8 B⁼10	coŀleció	has but	\pL in U+0063.006f.0140.006c.0065.0063.0069.00f3  
# 48: G⁼8 C⁼ 8 B⁼10	ÓiceĿloC	has but	\pL in U+00d3.0069.0063.0065.013f.006c.006f.0043  
# 49: G⁼8 C⁼ 9 B⁼11	coŀleció	has but	\pL in U+0063.006f.0140.006c.0065.0063.0069.006f.0301  
# 50: G⁼8 C⁼ 9 B⁼11	COĿLECIÓ	has but	\pL in U+0043.004f.013f.004c.0045.0043.0049.004f.0301  
# 51: G⁼9 C⁼10 B⁼12	col·leció	LACKS	\pL in U+0063.006f.006c.00b7.006c.0065.0063.0069.006f.0301  
# 52: G⁼9 C⁼10 B⁼12	COL·LECIÓ	LACKS	\pL in U+0043.004f.004c.00b7.004c.0045.0043.0049.004f.0301  

perl -CS -Mcharnames=:full,:short -le 'print for(
 "M","\N{Greek:Mu}","\x{B5}","\N{Greek:mu}","C",
 "\N{Latin:C WITH CEDILLA}","C\N{COMBINING CEDILLA}",
 "C\N{COMBINING CEDILLA}\N{COMBINING CARON}",
 "\N{Latin:C WITH CEDILLA}\N{COMBINING CARON}",
 "C\N{COMBINING CARON}\N{COMBINING CEDILLA}",
 "\N{SCRIPT SMALL E}\N{COMBINING CEDILLA}",
 "\N{SCRIPT SMALL E}","\N{SCRIPT CAPITAL R}",
 "\N{SCRIPT CAPITAL R}\N{COMBINING ENCLOSING CIRCLE BACKSLASH}",
 "\N{Greek:Pi}","\N{Greek:psi}\N{COMBINING RIGHT ARROW ABOVE}","?",
 "\N{LATIN LETTER GLOTTAL STOP}","\N{LATIN LETTER GLOTTAL STOP}".
 "\N{COMBINING TILDE OVERLAY}","\x{bf}","\N{Greek:pi}\N{SCRIPT SMALL E}",
 "\N{Greek:Phi}\N{SCRIPT SMALL E}\N{COMBINING MACRON}","\x{bf}?",
 "\N{LATIN LETTER PHARYNGEAL VOICED FRICATIVE}".
 "\N{LATIN LETTER INVERTED GLOTTAL STOP}",
 "\N{LATIN LETTER PHARYNGEAL VOICED FRICATIVE}".
 "\N{LATIN LETTER GLOTTAL STOP}","\x{2161}\x{aa}","II\x{aa}",
 "\N{Greek:Psi}\N{SCRIPT SMALL E}\N{SUPERSCRIPT MINUS}\N{SUPERSCRIPT ONE}",
 "C\N{Latin:o with acute}mo","Co\N{COMBINING ACUTE ACCENT}mo",
 "\x{bf}Co\N{COMBINING ACUTE ACCENT}mo?",
 "\N{LATIN LETTER INVERTED GLOTTAL STOP}Co\N{COMBINING ACUTE ACCENT}".
 "mo\N{LATIN LETTER GLOTTAL STOP}",
 "\N{LATIN LETTER INVERTED GLOTTAL STOP}\x{332}C\x{332}o\x{332}".
 "\N{COMBINING ACUTE ACCENT}\x{332}m\x{332}o\x{332}".
 "\N{LATIN LETTER GLOTTAL STOP}\x{332}",
 "wr\x{1fd}\N{Latin:thorn}\N{Latin:thorn}u",
 "\Uwr\x{1fd}\N{Latin:thorn}\N{Latin:thorn}u",
 "wr\x{e6}\x{301}\N{Latin:thorn}\N{Latin:thorn}u",
 "\Uwr\x{e6}\x{301}\N{Latin:thorn}\N{Latin:thorn}u",
 "la\N{Latin:yogh}amon",uc"La\N{Latin:yogh}amon",
 "tsch\N{Latin:u with diaeresis}\x{df}", 
 "tschu\N{COMBINING DIAERESIS}\x{df}",scalar reverse(
 "\utsch\N{Latin:u with diaeresis}\x{df}"), 
 "\Utsch\N{Latin:u with diaeresis}\x{df}",
 "\Utschu\N{COMBINING DIAERESIS}\x{df}",ucfirst scalar reverse(
 "tschu\N{COMBINING DIAERESIS}\x{df}"),ucfirst reverse(scalar reverse reverse
 "tschu\N{COMBINING DIAERESIS}\x{df}"=~/(?#YANETUT)\X/g),
 "co\x{140}leci\N{Latin:o with acute}",ucfirst reverse(
 "\ucol\u\x{140}eci\N{Latin:o with acute}"),
 "co\x{140}lecio\N{COMBINING ACUTE ACCENT}",
 "\Uco\x{140}lecio\N{COMBINING ACUTE ACCENT}",
 "col\x{b7}lecio\N{COMBINING ACUTE ACCENT}",
 "\Ucol\x{b7}lecio\N{COMBINING ACUTE ACCENT}",
)'|perl -CS -mbytes -nle '(($m)=m=\A(\X+)\z=)||die;
   printf"%2d: G\x{207C}%d C\x{207C}%2d B\x{207C}%2d".
   "\t%s\t%s\t\\pL in U+%v04x  \n",++$i,scalar(()= 
   $m=~m~\X~g),(length$m,bytes::length$m,$m),$m=~m
   "\A(?:(?=\pL)\X)+\z"?"has but":"LACKS",$m,
;';

Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About