develooper Front page | perl.perl5.porters | Postings from December 2009

RFC: regex /i folding always use utf8?

Thread Next
From:
karl williamson
Date:
December 6, 2009 22:06
Subject:
RFC: regex /i folding always use utf8?
Message ID:
4B1C9B46.2020303@khwilliamson.com
I have been trying to solve the discrepancies involving the semantics 
being different when a scalar is stored in utf8 or not.

To review, there are 3 major and 1 very minor known areas where this 
occurs.  Blead already contains a fix for one of the major areas: case 
changing via uc() and its cousins.

I am about to submit a patch that solves it for another of the major 
areas: regex matching (non-folded).  And I'm close to having a patch for 
the minor area.

If those patches are accepted, it will leave just one area left, and 
that is qr/.../i.  I think it would be a very good thing if the whole 
problem could be solved for 5.12.

I want to throw out for comment the possibility that this could be 
solved trivially by always using utf8 for case insensitive matching.

Already blead does this if the regex has a trie (although from comments 
in the code, the need for this might stem from the inconsistent 
behavior, which I'm fixing, so it's possible that the new patch will 
allow tries to not have to be utf8; I'm not sure.)

I was working on the case folding issue earlier this year, and found 
problem after problem, bug after bug.  Some of these are fixed by going 
to utf8; some are not.

It's pretty clear to me that people aren't using Perl for serious 
Unicode work with case folding, or there would be a lot more bug reports 
on it than there are.  Before I got distracted by fixing mktables, I was 
coming to the idea that the current scheme of things just might not ever 
fully work for code points that have multiple character folds--that 
possibility just never was planned for in the original algorithm.

But that is a discussion for another time.  By just changing things so 
that /i implies a utf8 pattern, we trivially solve the remaining known 
inconsistencies between utf8ness or not, at the expense of execution 
slow-down.  That trade-off was already deemed worth taking for tries. 
I'm wondering what people think of doing it for all /i regexes?

Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About