develooper Front page | perl.perl5.porters | Postings from May 2010

PATCH: [perl #58182] partial, "The Unicode Bug". Add unicode semanticsfor \s, \w

Thread Previous
From:
karl williamson
Date:
May 11, 2010 11:52
Subject:
PATCH: [perl #58182] partial, "The Unicode Bug". Add unicode semanticsfor \s, \w
Message ID:
4BE9A72A.1020807@khwilliamson.com
make regen is required.  Also, this patches a .t in Test::Simple; I'm 
cc'ing the cpan maintainer.

The attached series of commits fix the inconsistent handling of Latin1 
characters in matching \s, \w, and hence \b (boundary matching) and 
their complements.  This solves the second of the 5 areas of the 
"Unicode Bug".  (The first, lc(), ucfirst(), ... was fixed for 5.12. 
Those remaining are matching POSIX character classes, matching /i, and 
user-defined case mappings.)

These commits also add regex modifiers /u (unicode), /l (locale), and /t 
(traditional).  /a is not part of this patch.  I have made up the term 
"Matching mode" to describe this.  I'm open to a better term, if you can 
think of one.

Much of this patch was submitted and withdrawn last year.  It has a 
somewhat cleaner implementation than that one, in that no new regnodes 
were added.  Instead, it turns out that the flags field in the affected 
regnodes was unused.  By using that, we fly under the radar of some 
other code, which as a result didn't have to change.

Note that there is a behavior change that may be incompatible with 
existing code.  Previously, if a regex is compiled from within 'use 
locale', and then interpolated into another regex outside it, the 
localeness of the interpolated part is lost.  And vice versa.  This 
patch causes the regex to remember how it was compiled, so it stays with 
it even when interpolated.

Also, the stringification of a regex will show its matching mode 
modifier, e.g., 't', so code that looks at that will have to change. 
Several of the .t changes are because of this, and because the minimum 
length of this changed.  For example, (?t-xism:...) with this patch, 
instead of (?-xism:...) before it.

I'm working on the pod changes, and will submit them later.

Thread Previous


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About