On Sun, Dec 06, 2009 at 11:05:58PM -0700, karl williamson wrote: > I have been trying to solve the discrepancies involving the > semantics being different when a scalar is stored in utf8 or not. > > To review, there are 3 major and 1 very minor known areas where this > occurs. Blead already contains a fix for one of the major areas: > case changing via uc() and its cousins. > > I am about to submit a patch that solves it for another of the major > areas: regex matching (non-folded). And I'm close to having a patch > for the minor area. > > If those patches are accepted, it will leave just one area left, and > that is qr/.../i. I think it would be a very good thing if the > whole problem could be solved for 5.12. > > I want to throw out for comment the possibility that this could be > solved trivially by always using utf8 for case insensitive matching. > > Already blead does this if the regex has a trie (although from > comments in the code, the need for this might stem from the > inconsistent behavior, which I'm fixing, so it's possible that the > new patch will allow tries to not have to be utf8; I'm not sure.) > > I was working on the case folding issue earlier this year, and found > problem after problem, bug after bug. Some of these are fixed by > going to utf8; some are not. > > It's pretty clear to me that people aren't using Perl for serious > Unicode work with case folding, or there would be a lot more bug > reports on it than there are. Before I got distracted by fixing > mktables, I was coming to the idea that the current scheme of things > just might not ever fully work for code points that have multiple > character folds--that possibility just never was planned for in the > original algorithm. Regarding folding multiple charcters. I think the current method of using a code-point as a unit should be abandoned, a minimum number of code points (which is also a minimum number of bytes) might be usefull to quickly skip things, but otherwise they are useless. But like you said that is a discussion for another time. > But that is a discussion for another time. By just changing things > so that /i implies a utf8 pattern, we trivially solve the remaining > known inconsistencies between utf8ness or not, at the expense of > execution slow-down. That trade-off was already deemed worth taking > for tries. I'm wondering what people think of doing it for all /i > regexes? I think it is a good idea. But I suspect some people will be less pleased when they realize how inefficient the current utf8 matching is. Gerard GoossenThread Previous | Thread Next