develooper Front page | perl.perl6.language | Postings from March 2011

Re: UCA and NFC/NFD issues in pattern matching

Thread Previous
From:
Helmut Wollmersdorfer
Date:
March 6, 2011 04:23
Subject:
Re: UCA and NFC/NFD issues in pattern matching
Message ID:
4D737CB9.4070107@wollmersdorfer.at
Tom Christiansen wrote:
> I have two points.  First, this excerpt from Synopsis 6:

>     The :m (or :ignoremark) modifier scopes exactly like :ignorecase except
>     that it ignores marks (accents and such) instead of case. It is equivalent
>     to taking each grapheme (in both target and pattern), converting both to
>     NFD (maximally decomposed) and then comparing the two base characters
>     (Unicode non-mark characters) while ignoring any trailing mark characters.
>     The mark characters are ignored only for the purpose of determining the
>     truth of the assertion; the actual text matched includes all ignored
>     characters, including any that follow the final base character.

>     The :mm (or :samemark) variant may be used on a substitution to change the
>     substituted string to the same mark/accent pattern as the matched string.
>     Mark info is carried across on a character by character basis. If the right
>     string is longer than the left one, the remaining characters are
>     substituted without any modification. (Note that NFD/NFC distinctions are
>     usually immaterial, since Perl encapsulates that in grapheme mode.) Under
>     :sigspace the preceding rules are applied word by word.  In perl5, one must
>     manually run two matches on all data.

> First: I notice that ignoring marks (and such) and ignoring case are both
> differently strengthed effects of the Unicode Collation Algorithm.  What
> about simply allowing folks to specify which of the four (or more, I guess)
> levels of UCA equivalence/folding they want?

Draft Unicode Technical Report #30
Character Foldings
http://www.unicode.org/reports/tr30/tr30-4.html

This one?

IMHO this should not be specified in the core of Perl6. Even the
existance of :ignoremark and :samemark is not necessary, because it
cannot fullfill the expectations: 'LATIN O WITH STROKE' (and other
characters with e.g. overlays) is not decomposable, and will not match
'LATIN O' under :ignoremark.

Even :ignorecase is usable only in the character range of ASCII, but is
needed for backwards compatibility.

E.g. the German 'SHARP S' can be written in uppercase as 'SS' or 'SZ'. 
And Swiss orthography doesn't use 'SHARP S', they always use 'ss'.

If someone wants to match the 'SHARP S' across all orthographic and 
typographic variants there is no other way as to write manually 
something like:

   $string =~ m/(ß|ss|sz)/i;

Language and text processing is full of such examples, which cannot be 
solved by Unicode in a general way. Here I agree with Larry that Perl6 
should only support the general part of Unicode.

Language/Locale (including orthography and typography) specific 
processing is the task of Unicode localisation, which should IMHO be 
implemented by modules. The more I think about it, I cannot imagine a 
general solution using tailored Unicode-properties for localisation.

> Second: I'm not altogether reassured by the parenned bit about NFD/NFC
> being immaterial.  That's because I've been pretty annoying lately in perl5
> with having to manually run *everything* through a double match every time,
> and I can't avoid it by prenormalizing.  I'm just hoping that perl6 will
> handle this better.
> 
> It's usually like this:
> 
>     NFD($data) =~ $pattern
>     NFC($data) =~ $pattern
> 
> Or if you know your data is NFD:
> 
>         $data  =~ $pattern
>     NFC($data) =~ $pattern
> 
> Or if you know your data is NFC:
> 
>     NFD($data) =~ $pattern
>         $data  =~ $pattern
> 
> That's because even if your data in a known state with respect to
> normalization, if your pattern admits both NFD and NFC forms, which it
> would if read in from a file etc, then you have to run them both.

Mixing different levels of normalisation isn't a good idea. Just bring 
everthing involved (including the patterns) to the same level.

In Perl5 a similar problem exists if someone mixes byte-mode and 
character mode. Then AFAIK a regex like

	$byte_string =~ m/\p{Letter}/;

crashes.

> For example, suppose you read a pattern whose characters are specified
> indirectly/symbolically:
> 
>     $pattern = q<\xE9>;    	# LATIN SMALL LETTER E WITH ACUTE
> 
> or 
> 
>     $pattern = q<e\x{301}>;   	# "e" + COMBINING ACUTE ACCENT
> 
> It would be ok if those were literal characters, because you
> could just NFD the patterns and be done.  But they're not.  So
> in order for
> 
> 
>     $data =~ $pattern
> 
> to work properly with both, you really have to do a guaranteed
> double-convert/match each time.  This is rather unfortunate, to put it
> mildly.  What you really want is a pattern compile flag that imposes
> canonical matching, and does this correctly even when faced with named
> characters, etc.
> 
> My read of S06 suggests that this will not be an issue.  

In Grapheme mode the pattern q<e\x{301}> normalizes to a single Grapheme 
character. That's why Graphemes are so convenient. And Graphemes are 
also compatible with future versions of Unicode. I.e. your code will 
work, if e.g. a future version of Unicode assigns a single codepoint for 
'LATIN SMALL LETTER A WITH POINT ABOVE AND POINT BELOW' and your code 
contains something like 'a'+'COMBINING DOT ABOVE'+'COMBINING DOT BELOW'.

> I do wonder
> what happens when you want to match just the combining part.  Does
> that fail in grapheme mode?  It shouldn't: you *can* have standalones.

In Grapheme mode 'standalones' can only happen at the beginning of a 
string, or better said without a base character somewhere before them.

> But then we're back to partial matches in the middle of things, which
> is something that plagues us with full Unicode case-folding.  This is
> the 
> 
>     "\N{LATIN SMALL LIGATURE FFI}" =~ /(f)(f)/i
> 
> problem, amongst others.  Seems that you are going to get into the
> same dilemma if you allow matching partial graphemes in grapheme mode.

We can dream of :ignoreorthography or :ignoretypography, but they should 
not be implemented into a regex-engine.

Helmut Wollmersdorfer


Thread Previous


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About