On 4/24/07, Juerd Waalboer <juerd@convolution.nl> wrote: > demerphq skribis 2007-04-24 11:37 (+0200): > > One would assume that unicode semantics would be obeyed when either > > the string or pattern was unicode, and that latin1 semantics (for lack > > of a better term) would be followed only when neither were unicode. > > If I didn't know Perl, I would assume that it would always use Unicode > semantics, or never, because I read somewhere that Perl only has one > string type. > > > The problem is that the optimiser thinks that /\xDF/i under unicode is > > really 'ss' and therefore that the minimum length string that can > > match is 2. > > Ouch. > > > At this point the only solution I can think of is to disable minlen > > checks when a character is encountered that folds to a multi-character > > string. > > I think correctness is more important than performance, especially when > it is needed for real world languages like German. Turns out this nbug affects Greek and German, three codepoints in total: 00DF; F; 0073 0073; # LATIN SMALL LETTER SHARP S 0390; F; 03B9 0308 0301; # GREEK SMALL LETTER IOTA WITH DIALYTIKA AND TONOS 03B0; F; 03C5 0308 0301; # GREEK SMALL LETTER UPSILON WITH DIALYTIKA AND TONOS The fact that it doesnt affect any of the other 106 special case foldings in the unicode 5 spec is IMO a miracle perched on top of a bug perched on top of a melting ice-cream-cone. cheers, Yves -- perl -Mre=debug -e "/just|another|perl|hacker/"Thread Previous