2009/12/7 karl williamson <public@khwilliamson.com>: > I have been trying to solve the discrepancies involving the semantics being > different when a scalar is stored in utf8 or not. > > To review, there are 3 major and 1 very minor known areas where this occurs. > Blead already contains a fix for one of the major areas: case changing via > uc() and its cousins. > > I am about to submit a patch that solves it for another of the major areas: > regex matching (non-folded). And I'm close to having a patch for the minor > area. > > If those patches are accepted, it will leave just one area left, and that is > qr/.../i. I think it would be a very good thing if the whole problem could > be solved for 5.12. I concur; that would improve release consistency. > I want to throw out for comment the possibility that this could be solved > trivially by always using utf8 for case insensitive matching. Unless "use legacy" is activated ? > Already blead does this if the regex has a trie (although from comments in > the code, the need for this might stem from the inconsistent behavior, which > I'm fixing, so it's possible that the new patch will allow tries to not have > to be utf8; I'm not sure.) > > I was working on the case folding issue earlier this year, and found problem > after problem, bug after bug. Some of these are fixed by going to utf8; > some are not. > > It's pretty clear to me that people aren't using Perl for serious Unicode > work with case folding, or there would be a lot more bug reports on it than Yes, that's also my impression. I think that Perl's behaviour is currently a bit too difficult to understand, and people seem to prefer implementing half-cargo-culted workarounds than reporting bugs. > there are. Before I got distracted by fixing mktables, I was coming to the > idea that the current scheme of things just might not ever fully work for > code points that have multiple character folds--that possibility just never > was planned for in the original algorithm. > > But that is a discussion for another time. By just changing things so that > /i implies a utf8 pattern, we trivially solve the remaining known > inconsistencies between utf8ness or not, at the expense of execution > slow-down. That trade-off was already deemed worth taking for tries. I'm > wondering what people think of doing it for all /i regexes? I would vote for.Thread Previous | Thread Next