Front page | perl.perl6.language |
Postings from May 2010
Perl6 and "accents"
Thread Next
From:
Tom Christiansen
Date:
May 17, 2010 10:52
Subject:
Perl6 and "accents"
Message ID:
29651.1274118754@chthon
Exegesis 5 @ http://dev.perl.org/perl6/doc/design/exe/E05.html reads:
# Perl 6
/ < <alpha> - [A-Za-z] >+ / # All alphabetics except A-Z or a-z
# (i.e. the accented alphabetics)
[Update: Would now need to be <+<alpha> - [A..Za..z]> to avoid ambiguity
with "Texas quotes", and because we want to reserve whitespace as the first
character inside the angles for other uses.]
Explicit character classes were deliberately made a little less convenient
in Perl 6, because they're generally a bad idea in a Unicode world. For
example, the [A-Za-z] character class in the above examples won't even
match standard alphabetic Latin-1 characters like 'Ã', 'é', 'ø', let alone
alphabetic characters from code-sets such as Cyrillic, Hiragana, Ogham,
Cherokee, or Klingon.
First off, that "i.e. the accented alphabetics" phrasing is quite incorrect!
Code like /[^\P{Alpha}A-Za-z]/ matches not just things like
00C1 LATIN CAPITAL LETTER A WITH ACUTE
00C7 LATIN CAPITAL LETTER C WITH CEDILLA
00C8 LATIN CAPITAL LETTER E WITH GRAVE
00E5 LATIN SMALL LETTER A WITH RING ABOVE
00F1 LATIN SMALL LETTER N WITH TILDE
but also of course:
00AA FEMININE ORDINAL INDICATOR
00B5 MICRO SIGN
00BA MASCULINE ORDINAL INDICATOR
00C6 LATIN CAPITAL LETTER AE
00D0 LATIN CAPITAL LETTER ETH
00DE LATIN CAPITAL LETTER THORN
00DF LATIN SMALL LETTER SHARP S
00E6 LATIN SMALL LETTER AE
00F0 LATIN SMALL LETTER ETH
01A6 LATIN LETTER YR
01BA LATIN SMALL LETTER EZH WITH TAIL
01BC LATIN CAPITAL LETTER TONE FIVE
01BF LATIN LETTER WYNN
02C7 CARON
0391 GREEK CAPITAL LETTER ALPHA
0410 CYRILLIC CAPITAL LETTER A
and many, many more.
I'm also disappointed to see perl6 spreading the notion that "accent"
is somehow a valid synonym for
diacritical marking
diacritic marking
diacritic mark
diacritic
mark
It's not. Accent is not a synonym for any of those. Not all marks are
accents, and not all accents are marks.
I believe what is meant by "accent" is NFD($char) =~ /\pM/. Fine: then
say "with diacritics", not "with accents".
Also, there are many combining characters that aren't "accents" by any
stretch of term, such as 20E3 COMBINING ENCLOSING KEYCAP, to name just one.
Only three code points have official names that include "ACCENT", and even
these are dubious.
Finally, I note also that people use the Alpha property too loosely. Note
the caron and such above. One probably wants the LC property instead.
--tom
use charnames ();
use Unicode::Normalize;
for $cp ( 1 .. 0xffff ) {
$orig = chr($cp);
$canon = NFD($orig); # NFKD gives diff results
## if ($orig =~ /[^\P{Alpha}A-Za-z]/) {
if ($orig =~ /\p{LC}/ && $canon !~ /^[A-Za-z]/) {
printf("%c %04X %s\n", $cp, $cp, charnames::viacode($cp));
}
}
Thread Next
-
Perl6 and "accents"
by Tom Christiansen