develooper Front page | perl.perl5.porters | Postings from December 2009

Re: RFC: regex /i folding always use utf8?

Thread Previous | Thread Next
Rafael Garcia-Suarez
December 7, 2009 02:31
Re: RFC: regex /i folding always use utf8?
Message ID:
2009/12/7 karl williamson <>:
> I have been trying to solve the discrepancies involving the semantics being
> different when a scalar is stored in utf8 or not.
> To review, there are 3 major and 1 very minor known areas where this occurs.
>  Blead already contains a fix for one of the major areas: case changing via
> uc() and its cousins.
> I am about to submit a patch that solves it for another of the major areas:
> regex matching (non-folded).  And I'm close to having a patch for the minor
> area.
> If those patches are accepted, it will leave just one area left, and that is
> qr/.../i.  I think it would be a very good thing if the whole problem could
> be solved for 5.12.

I concur; that would improve release consistency.

> I want to throw out for comment the possibility that this could be solved
> trivially by always using utf8 for case insensitive matching.

Unless "use legacy" is activated ?

> Already blead does this if the regex has a trie (although from comments in
> the code, the need for this might stem from the inconsistent behavior, which
> I'm fixing, so it's possible that the new patch will allow tries to not have
> to be utf8; I'm not sure.)
> I was working on the case folding issue earlier this year, and found problem
> after problem, bug after bug.  Some of these are fixed by going to utf8;
> some are not.
> It's pretty clear to me that people aren't using Perl for serious Unicode
> work with case folding, or there would be a lot more bug reports on it than

Yes, that's also my impression. I think that Perl's behaviour is currently
a bit too difficult to understand, and people seem to prefer implementing
half-cargo-culted workarounds than reporting bugs.

> there are.  Before I got distracted by fixing mktables, I was coming to the
> idea that the current scheme of things just might not ever fully work for
> code points that have multiple character folds--that possibility just never
> was planned for in the original algorithm.
> But that is a discussion for another time.  By just changing things so that
> /i implies a utf8 pattern, we trivially solve the remaining known
> inconsistencies between utf8ness or not, at the expense of execution
> slow-down.  That trade-off was already deemed worth taking for tries. I'm
> wondering what people think of doing it for all /i regexes?

I would vote for.

Thread Previous | Thread Next Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at | Group listing | About