develooper Front page | perl.perl6.language | Postings from May 2010

Re: Perl6 and "accents"

Thread Previous | Thread Next
From:
Helmut Wollmersdorfer
Date:
May 18, 2010 01:28
Subject:
Re: Perl6 and "accents"
Message ID:
4BF24FB3.6090008@wollmersdorfer.at
Tom Christiansen wrote:
> Exegesis 5 @ http://dev.perl.org/perl6/doc/design/exe/E05.html reads:

>   # Perl 6
>   / < <alpha> - [A-Za-z] >+ /   # All alphabetics except A-Z or a-z
> 				# (i.e. the accented alphabetics)

>     [Update: Would now need to be <+<alpha> - [A..Za..z]> to avoid ambiguity
>     with "Texas quotes", and because we want to reserve whitespace as the first
>     character inside the angles for other uses.]

>     Explicit character classes were deliberately made a little less convenient
>     in Perl 6, because they're generally a bad idea in a Unicode world. For
>     example, the [A-Za-z] character class in the above examples won't even
>     match standard alphabetic Latin-1 characters like 'Ã', 'é', 'ø', let alone
>     alphabetic characters from code-sets such as Cyrillic, Hiragana, Ogham,
>     Cherokee, or Klingon.

> First off, that "i.e. the accented alphabetics" phrasing is quite incorrect!  

Of course. If the author intended to match "special" (= non-ASCII) Latin 
letters, it should be something like

   use charnames ();
   for $codepoint ( 1 .. 0xffff ) {
     $char  = chr($codepoint);
     if (
       $char =~ /\p{L}/
       && $char =~ /\p{Latin}/
       && $char !~ /[A-Za-z]/) {
       printf("%c %04X %s\n", $cp, $cp, charnames::viacode($codepoint));
     }
    }

> Code like /[^\P{Alpha}A-Za-z]/ matches not just things like
[...]
> but also of course:

[...]

>     00C6 LATIN CAPITAL LETTER AE
>     00D0 LATIN CAPITAL LETTER ETH

Good examples.

Both cannot be decomposed. Depending on your needs 'LETTER AE' can be 
seen as a ligature. For example current botanical Latin allows (AFAIK) 
'LETTER AE' but also 'LETTER A' + 'LETTER E'. If someone needs to match 
both variants, there is no way around a local-specific transliteration.

'LATIN CAPITAL LETTER ETH' looks like an accented character (0110 LATIN 
CAPITAL LETTER D WITH STROKE). Unicode policy does not (did not) allow 
(de-)composition of overlays, which is the case for example for all 
characters 'WITH STROKE'. Thus ':ignoremark' and ':samemark' will be 
useless, if someone needs similarity matching of e.g.

   unmark('ø') =~ /o/

[...]

> It's not.  Accent is not a synonym for any of those.  Not all marks are
> accents, and not all accents are marks.

> I believe what is meant by "accent" is NFD($char) =~ /\pM/.  Fine: then
> say "with diacritics", not "with accents". 

Agreed. Everything related to Unicode should use Unicode terms at least 
in the definition. And if a Unicode term is used it should exactly mean 
what is specified in the Unicode standard. E.g. it would be a fault, if 
graphemes are defined by '\pX' or '(?>\PM\pM*)', as Unicode provides the 
properties 'Grapheme_Base' and 'Grapheme_Extend' (unfortunately they are 
not supported by Perl 5 or Perl 6).

Helmut Wollmersdorfer

Thread Previous | Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About