Front page | perl.perl6.language |
Postings from May 2010
Re: Perl6 and "accents"
Thread Previous
|
Thread Next
From:
Helmut Wollmersdorfer
Date:
May 18, 2010 01:28
Subject:
Re: Perl6 and "accents"
Message ID:
4BF24FB3.6090008@wollmersdorfer.at
Tom Christiansen wrote:
> Exegesis 5 @ http://dev.perl.org/perl6/doc/design/exe/E05.html reads:
> # Perl 6
> / < <alpha> - [A-Za-z] >+ / # All alphabetics except A-Z or a-z
> # (i.e. the accented alphabetics)
> [Update: Would now need to be <+<alpha> - [A..Za..z]> to avoid ambiguity
> with "Texas quotes", and because we want to reserve whitespace as the first
> character inside the angles for other uses.]
> Explicit character classes were deliberately made a little less convenient
> in Perl 6, because they're generally a bad idea in a Unicode world. For
> example, the [A-Za-z] character class in the above examples won't even
> match standard alphabetic Latin-1 characters like 'Ã', 'é', 'ø', let alone
> alphabetic characters from code-sets such as Cyrillic, Hiragana, Ogham,
> Cherokee, or Klingon.
> First off, that "i.e. the accented alphabetics" phrasing is quite incorrect!
Of course. If the author intended to match "special" (= non-ASCII) Latin
letters, it should be something like
use charnames ();
for $codepoint ( 1 .. 0xffff ) {
$char = chr($codepoint);
if (
$char =~ /\p{L}/
&& $char =~ /\p{Latin}/
&& $char !~ /[A-Za-z]/) {
printf("%c %04X %s\n", $cp, $cp, charnames::viacode($codepoint));
}
}
> Code like /[^\P{Alpha}A-Za-z]/ matches not just things like
[...]
> but also of course:
[...]
> 00C6 LATIN CAPITAL LETTER AE
> 00D0 LATIN CAPITAL LETTER ETH
Good examples.
Both cannot be decomposed. Depending on your needs 'LETTER AE' can be
seen as a ligature. For example current botanical Latin allows (AFAIK)
'LETTER AE' but also 'LETTER A' + 'LETTER E'. If someone needs to match
both variants, there is no way around a local-specific transliteration.
'LATIN CAPITAL LETTER ETH' looks like an accented character (0110 LATIN
CAPITAL LETTER D WITH STROKE). Unicode policy does not (did not) allow
(de-)composition of overlays, which is the case for example for all
characters 'WITH STROKE'. Thus ':ignoremark' and ':samemark' will be
useless, if someone needs similarity matching of e.g.
unmark('ø') =~ /o/
[...]
> It's not. Accent is not a synonym for any of those. Not all marks are
> accents, and not all accents are marks.
> I believe what is meant by "accent" is NFD($char) =~ /\pM/. Fine: then
> say "with diacritics", not "with accents".
Agreed. Everything related to Unicode should use Unicode terms at least
in the definition. And if a Unicode term is used it should exactly mean
what is specified in the Unicode standard. E.g. it would be a fault, if
graphemes are defined by '\pX' or '(?>\PM\pM*)', as Unicode provides the
properties 'Grapheme_Base' and 'Grapheme_Extend' (unfortunately they are
not supported by Perl 5 or Perl 6).
Helmut Wollmersdorfer
Thread Previous
|
Thread Next