develooper Front page | perl.perl5.porters | Postings from December 2009

Re: RFC: regex /i folding always use utf8?

Thread Previous | Thread Next
From:
demerphq
Date:
December 9, 2009 12:18
Subject:
Re: RFC: regex /i folding always use utf8?
Message ID:
9b18b3110912091217kbe3205w4109fd6c15d57535@mail.gmail.com
2009/12/7 karl williamson <public@khwilliamson.com>:
> I have been trying to solve the discrepancies involving the semantics being
> different when a scalar is stored in utf8 or not.
>
> To review, there are 3 major and 1 very minor known areas where this occurs.
>  Blead already contains a fix for one of the major areas: case changing via
> uc() and its cousins.
>
> I am about to submit a patch that solves it for another of the major areas:
> regex matching (non-folded).  And I'm close to having a patch for the minor
> area.
>
> If those patches are accepted, it will leave just one area left, and that is
> qr/.../i.  I think it would be a very good thing if the whole problem could
> be solved for 5.12.
>
> I want to throw out for comment the possibility that this could be solved
> trivially by always using utf8 for case insensitive matching.
>
> Already blead does this if the regex has a trie (although from comments in
> the code, the need for this might stem from the inconsistent behavior, which
> I'm fixing, so it's possible that the new patch will allow tries to not have
> to be utf8; I'm not sure.)

The problem comes from things like \xDF matching ( [sS][sS] | \xDF )
(there is new one too but it isnt a problem as its a "high
codepoint"). This means effectively that two tries have to be
constructed, one for the non-unicode case, and one for the unicode
case. A similar problem also comes up in character classes, in
particular with logical operations like [^[:alnum:]abc] and things
like that. Basically the idea was broken out of the box, just not
obviously enough that it was clear that you cant hack around it. Even
in the trie for a long time i thought supporting both in one structure
was doable, now I do not.

>
> I was working on the case folding issue earlier this year, and found problem
> after problem, bug after bug.  Some of these are fixed by going to utf8;
> some are not.

I can imagine.

> It's pretty clear to me that people aren't using Perl for serious Unicode
> work with case folding, or there would be a lot more bug reports on it than
> there are.  Before I got distracted by fixing mktables, I was coming to the
> idea that the current scheme of things just might not ever fully work for
> code points that have multiple character folds--that possibility just never
> was planned for in the original algorithm.

Maybe i misunderstand you, but....

Both streams are supposed to be foldcased as a normalizer and then
compared. The trie logic handles this as far as I know properly. The
matching algorithm was determined by the unicode folks so i dont see
why it shouldnt work.

Or do you mean something else? Our character classes? In which case I
can completely see your point. And jarkko long ago suggested to me
that we should put effort into rewriting the char-class code.

Cheers,
Yves


-- 
perl -Mre=debug -e "/just|another|perl|hacker/"

Thread Previous | Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About