develooper Front page | perl.perl5.porters | Postings from December 2009

Re: RFC: regex /i folding always use utf8?

Thread Previous | Thread Next
From:
demerphq
Date:
December 10, 2009 01:41
Subject:
Re: RFC: regex /i folding always use utf8?
Message ID:
9b18b3110912100140p2ea7d499i5be81610cbce80e@mail.gmail.com
2009/12/10 karl williamson <public@khwilliamson.com>:
> I can't remember all the details now; and need to get into it again to
> reconstruct it.  I should have submitted a bug report.  I hope I've learned
> my lesson.
>
> The part I remember is about char classes, and maybe that is the whole
> thing.  I started writing code around it.  One issue is that almost half the
> letters of the ASCII alphabet in 5.1 are whole or parts of folded utf8
> characters.  E.g., f i is the fold for the ligature fi; k is a fold for the
> Kelvin symbol, etc.  When these are in char classes, they can get optimized
> out (I don't remember the details right now, but I have code that does) so
> that they just don't exist when a utf8 string comes along to be matched.

Rght, charclasses being broken in some unicode contexts does not
surprise me, the CC algorithm is not designed to handle
multi-codepoint folding, and is quite inefficient when operating on
unicode charclasses.

An AWESOME project for someone with tuits and an interest would be to
implement one of the other data structures for charclasses, like
skiplists,  and possibly use the trie for tricky folding scenarios.

Yves

-- 
perl -Mre=debug -e "/just|another|perl|hacker/"

Thread Previous | Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About