develooper Front page | perl.perl5.porters | Postings from December 2009

Re: RFC: regex /i folding always use utf8?

Thread Previous | Thread Next
Gerard Goossen
December 8, 2009 04:41
Re: RFC: regex /i folding always use utf8?
Message ID:
On Sun, Dec 06, 2009 at 11:05:58PM -0700, karl williamson wrote:
> I have been trying to solve the discrepancies involving the
> semantics being different when a scalar is stored in utf8 or not.
> To review, there are 3 major and 1 very minor known areas where this
> occurs.  Blead already contains a fix for one of the major areas:
> case changing via uc() and its cousins.
> I am about to submit a patch that solves it for another of the major
> areas: regex matching (non-folded).  And I'm close to having a patch
> for the minor area.
> If those patches are accepted, it will leave just one area left, and
> that is qr/.../i.  I think it would be a very good thing if the
> whole problem could be solved for 5.12.
> I want to throw out for comment the possibility that this could be
> solved trivially by always using utf8 for case insensitive matching.
> Already blead does this if the regex has a trie (although from
> comments in the code, the need for this might stem from the
> inconsistent behavior, which I'm fixing, so it's possible that the
> new patch will allow tries to not have to be utf8; I'm not sure.)
> I was working on the case folding issue earlier this year, and found
> problem after problem, bug after bug.  Some of these are fixed by
> going to utf8; some are not.
> It's pretty clear to me that people aren't using Perl for serious
> Unicode work with case folding, or there would be a lot more bug
> reports on it than there are.  Before I got distracted by fixing
> mktables, I was coming to the idea that the current scheme of things
> just might not ever fully work for code points that have multiple
> character folds--that possibility just never was planned for in the
> original algorithm.

Regarding folding multiple charcters. I think the current method of using
a code-point as a unit should be abandoned, a minimum number of code points
(which is also a minimum number of bytes) might be usefull to quickly skip
things, but otherwise they are useless. But like you said that is a
discussion for another time.

> But that is a discussion for another time.  By just changing things
> so that /i implies a utf8 pattern, we trivially solve the remaining
> known inconsistencies between utf8ness or not, at the expense of
> execution slow-down.  That trade-off was already deemed worth taking
> for tries. I'm wondering what people think of doing it for all /i
> regexes?

I think it is a good idea. But I suspect some people will be less pleased
when they realize how inefficient the current utf8 matching is.

Gerard Goossen

Thread Previous | Thread Next Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at | Group listing | About