develooper Front page | perl.perl5.porters | Postings from December 2009

Re: RFC: regex /i folding always use utf8?

Thread Previous | Thread Next
From:
Rafael Garcia-Suarez
Date:
December 7, 2009 02:31
Subject:
Re: RFC: regex /i folding always use utf8?
Message ID:
b77c1dce0912070231y34038c05teaca0ffff60c74d7@mail.gmail.com
2009/12/7 karl williamson <public@khwilliamson.com>:
> I have been trying to solve the discrepancies involving the semantics being
> different when a scalar is stored in utf8 or not.
>
> To review, there are 3 major and 1 very minor known areas where this occurs.
>  Blead already contains a fix for one of the major areas: case changing via
> uc() and its cousins.
>
> I am about to submit a patch that solves it for another of the major areas:
> regex matching (non-folded).  And I'm close to having a patch for the minor
> area.
>
> If those patches are accepted, it will leave just one area left, and that is
> qr/.../i.  I think it would be a very good thing if the whole problem could
> be solved for 5.12.

I concur; that would improve release consistency.

> I want to throw out for comment the possibility that this could be solved
> trivially by always using utf8 for case insensitive matching.

Unless "use legacy" is activated ?

> Already blead does this if the regex has a trie (although from comments in
> the code, the need for this might stem from the inconsistent behavior, which
> I'm fixing, so it's possible that the new patch will allow tries to not have
> to be utf8; I'm not sure.)
>
> I was working on the case folding issue earlier this year, and found problem
> after problem, bug after bug.  Some of these are fixed by going to utf8;
> some are not.
>
> It's pretty clear to me that people aren't using Perl for serious Unicode
> work with case folding, or there would be a lot more bug reports on it than

Yes, that's also my impression. I think that Perl's behaviour is currently
a bit too difficult to understand, and people seem to prefer implementing
half-cargo-culted workarounds than reporting bugs.

> there are.  Before I got distracted by fixing mktables, I was coming to the
> idea that the current scheme of things just might not ever fully work for
> code points that have multiple character folds--that possibility just never
> was planned for in the original algorithm.
>
> But that is a discussion for another time.  By just changing things so that
> /i implies a utf8 pattern, we trivially solve the remaining known
> inconsistencies between utf8ness or not, at the expense of execution
> slow-down.  That trade-off was already deemed worth taking for tries. I'm
> wondering what people think of doing it for all /i regexes?

I would vote for.

Thread Previous | Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About