develooper Front page | perl.perl5.porters | Postings from May 2010

PATCH: [perl #58182] partial, "The Unicode Bug". Add unicode semanticsfor \s, \w

Thread Next
karl williamson
May 11, 2010 11:54
PATCH: [perl #58182] partial, "The Unicode Bug". Add unicode semanticsfor \s, \w
Message ID:
Oops.  All that work, and then I forgot to attach the patches.  Now 
doing so.

make regen is required.  Also, this patches a .t in Test::Simple; I'm
cc'ing the cpan maintainer.

The attached series of commits fix the inconsistent handling of Latin1
characters in matching \s, \w, and hence \b (boundary matching) and
their complements.  This solves the second of the 5 areas of the
"Unicode Bug".  (The first, lc(), ucfirst(), ... was fixed for 5.12.
Those remaining are matching POSIX character classes, matching /i, and
user-defined case mappings.)

These commits also add regex modifiers /u (unicode), /l (locale), and /t
(traditional).  /a is not part of this patch.  I have made up the term
"Matching mode" to describe this.  I'm open to a better term, if you can
think of one.

Much of this patch was submitted and withdrawn last year.  It has a
somewhat cleaner implementation than that one, in that no new regnodes
were added.  Instead, it turns out that the flags field in the affected
regnodes was unused.  By using that, we fly under the radar of some
other code, which as a result didn't have to change.

Note that there is a behavior change that may be incompatible with
existing code.  Previously, if a regex is compiled from within 'use
locale', and then interpolated into another regex outside it, the
localeness of the interpolated part is lost.  And vice versa.  This
patch causes the regex to remember how it was compiled, so it stays with
it even when interpolated.

Also, the stringification of a regex will show its matching mode
modifier, e.g., 't', so code that looks at that will have to change.
Several of the .t changes are because of this, and because the minimum
length of this changed.  For example, (?t-xism:...) with this patch,
instead of (?-xism:...) before it.

I'm working on the pod changes, and will submit them later.

Thread Next Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at | Group listing | About