develooper Front page | perl.perl5.porters | Postings from April 2007

Re: Analysis of problems with mixed encoding case insensitive matches in regex engine.

Thread Previous | Thread Next
From:
Juerd Waalboer
Date:
April 24, 2007 03:03
Subject:
Re: Analysis of problems with mixed encoding case insensitive matches in regex engine.
Message ID:
20070424095954.GD20929@c4.convolution.nl
demerphq skribis 2007-04-24 11:37 (+0200):
> One would assume that unicode semantics would be obeyed when either
> the string or pattern was unicode, and that latin1 semantics (for lack
> of a better term) would be followed only when neither were unicode.

If I didn't know Perl, I would assume that it would always use Unicode
semantics, or never, because I read somewhere that Perl only has one
string type.

> The problem is that the optimiser thinks that /\xDF/i under unicode is
> really 'ss' and therefore that the minimum length string that can
> match is 2.

Ouch.

> At this point the only solution I can think of is to disable minlen
> checks when a character is encountered that folds to a multi-character
> string.

I think correctness is more important than performance, especially when
it is needed for real world languages like German.
-- 
korajn salutojn,

  juerd waalboer:  perl hacker  <juerd@juerd.nl>  <http://juerd.nl/sig>
  convolution:     ict solutions and consultancy <sales@convolution.nl>

Thread Previous | Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About