develooper Front page | perl.perl5.porters | Postings from October 2009

Re: What should \s \w \d match in 5.12?

Thread Previous | Thread Next
From:
karl williamson
Date:
October 5, 2009 22:17
Subject:
Re: What should \s \w \d match in 5.12?
Message ID:
4ACAD290.4030708@khwilliamson.com
demerphq wrote:
> 2009/10/5 Tom Christiansen <tchrist@perl.com>:
>>> \w should be its historical meaning
>> Careful: wouldn't historical meaning include
>> locales, wherein \w would also include (for example)
>> é and ç in French, ñ in Spanish, ß in German,
>> and ð and þ in Icelandic?  And didn't we already
>> find that locale-shifting char classes made
>> life really hard on the regex engine (at least)?
> 
> use locale is in some respects broken by qr//, as it doesnt use regex
> flags and depends on the context it is compiled within.
> 
> So for instance, if you use local and the have a sub return a qr//
> compiled regex and then use that object alone in a match anywhere that
> you pass it it will match using the semantics of the locale in effect
> when it is matched. If the qr// is inserted in another pattern the
> localeness of the pattern is destroyed.
> 
> In short qr// results compiled under use locale have different results
> depending on how they are used. These regexes are also much slower
> than ones not compiled under locale as they have to do a lot more run
> time comparisons to check if they match.
> 
>> I don't know whether this is harder on it than
>> it already suffers under the Unicode vs bytes
>> shifts in behavior, but both seem problematic
>> to an annoying degree.
> 
> Locale regexes are irritating because you cant precompute them. They
> are defined to change based on your environment which can change in
> between compilation and execution of the regex. So you delay a lot of
> stuff that could be precomputed to inside of the regex matching loop.
> 
>> This is why my test program was tricked into
>> thinking \s suddenly started matching VT like
>> \v does, despite decades of historical precedent.
>> I'd forced it into Unicode mode.  :(
> 
> And this is why we really really want \w and \s and \d to match the
> traditional thing, even if this means requiring people add something
> to older scripts to support the legacy behaviour. You cant tell what a
> pattern does by looking at it, you have to know the internal bit flags
> of the string involved.
> 
> Yves
> 
> 
> 

In reading these comments all at once, I'm not sure we are all on the 
same page as to the proposal, and what happens now.  So, let me state 
what I think both are; correct me if I'm wrong:

The way it works now:

With a 'use locale' or on an EBCDIC platform:
they match whatever the C language ctype routines say they match: 
isdigit() for \d, isspace() for \s, and isalnum() for \w (but I know \w 
adds underscore but I didn't see where it was doing that in a quick scan 
of the code).

Absent a 'use locale' and not on an EBCDIC platform:

If (the string being matched against doesn't have the utf8 flag on.
&& the regular expression doesn't contain something that would 
make it 			look like it should behave in utf8 semantics.  Any \p{} 
in it, 			for example, will force it into utf8)
{
	\d = [0-9]; \w = [_a-zA-Z\d]; \s = [ \t\f\r\n]

} else {

	they match what Unicode says, except that there are some bugs so 	that 
\w matches too much, like fractions.

}





What I meant to say was the proposal:
No change to 'use locale' or EBCDIC.  Even if we could deprecate 'use 
locale', we would be stuck with supporting it in 5.12, I think.

Otherwise, \d = [0-9]; \w = [_a-zA-Z\d]; \s = [ \t\f\r\n]
regardless.

Thread Previous | Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About