develooper Front page | perl.perl5.porters | Postings from April 2007

Re: Analysis of problems with mixed encoding case insensitive matches in regex engine.

Thread Previous
From:
demerphq
Date:
April 24, 2007 03:35
Subject:
Re: Analysis of problems with mixed encoding case insensitive matches in regex engine.
Message ID:
9b18b3110704240335l7e90724aj31a291435e8d2cc7@mail.gmail.com
On 4/24/07, Juerd Waalboer <juerd@convolution.nl> wrote:
> demerphq skribis 2007-04-24 11:37 (+0200):
> > One would assume that unicode semantics would be obeyed when either
> > the string or pattern was unicode, and that latin1 semantics (for lack
> > of a better term) would be followed only when neither were unicode.
>
> If I didn't know Perl, I would assume that it would always use Unicode
> semantics, or never, because I read somewhere that Perl only has one
> string type.
>
> > The problem is that the optimiser thinks that /\xDF/i under unicode is
> > really 'ss' and therefore that the minimum length string that can
> > match is 2.
>
> Ouch.
>
> > At this point the only solution I can think of is to disable minlen
> > checks when a character is encountered that folds to a multi-character
> > string.
>
> I think correctness is more important than performance, especially when
> it is needed for real world languages like German.

Turns out this nbug affects Greek and German, three codepoints in total:

00DF; F; 0073 0073; # LATIN SMALL LETTER SHARP S
0390; F; 03B9 0308 0301; # GREEK SMALL LETTER IOTA WITH DIALYTIKA AND TONOS
03B0; F; 03C5 0308 0301; # GREEK SMALL LETTER UPSILON WITH DIALYTIKA AND TONOS

The fact that it doesnt affect any of the other 106 special case
foldings in the unicode 5 spec is IMO a miracle perched on top of a
bug perched on top of a melting ice-cream-cone.

cheers,
Yves

-- 
perl -Mre=debug -e "/just|another|perl|hacker/"

Thread Previous


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About