On Thu, Feb 08, 2007 at 01:04:14PM +0100, Juerd Waalboer wrote: > demerphq skribis 2007-02-08 12:19 (+0100): > > It has variable length characters (..) Also UTF8 has the property that > > there is no valid utf8 sequence that is itself a subsequence of a > > valid utf8 sequence. > > Because of the latter, the former is not a big problem. That is, if your > application allows you to be naive and just match bytes instead of > characters. > > In my "plan", you'd consider each byte a character. > > > Also your plan wouldnt work with case insensitive matching would it? > > Correct. But I did specifically mention that it is an incorrect > solution. Here, correctness is traded in for performance. When trying to > find all ? signs in a 300 MB string, and replacing them with ASCII E's, > ignoring that you're doing UTF8 helps a lot. > > You'd look for \x{20ac}, you'd be looking for \xe2\x82\xac. Do NOT use \xe2\x82\xac to create bytes. Use pack (or \x[e2]\x[82]\x[ac]) to create bytes. But looking for this byte sequence is already what the current regex engine does: #!perl use Benchmark "cmpthese"; use strict; use warnings; use utf8; use Encode; my $n = 200000; my $count = 200000000/$n; my $a = ("\x{20ad} abc" x $n) . "\x{20ac}"; my $a_bytes = $a; Encode::_utf8_off($a_bytes); cmpthese( $count, { utf8 => sub { $a =~ m/\x{20ac}/ or die; }, bytes => sub { use bytes; $a_bytes =~ m/\xe2\x82\xac/ or die; }, } ); bleadperl: Rate bytes utf8 bytes 337/s -- -0% utf8 338/s 0% -- Gerard Goossen.