demerphq wrote:
> 2009/10/5 Tom Christiansen <tchrist@perl.com>:
>>> \w should be its historical meaning
>> Careful: wouldn't historical meaning include
>> locales, wherein \w would also include (for example)
>> é and ç in French, ñ in Spanish, ß in German,
>> and ð and þ in Icelandic? And didn't we already
>> find that locale-shifting char classes made
>> life really hard on the regex engine (at least)?
>
> use locale is in some respects broken by qr//, as it doesnt use regex
> flags and depends on the context it is compiled within.
>
> So for instance, if you use local and the have a sub return a qr//
> compiled regex and then use that object alone in a match anywhere that
> you pass it it will match using the semantics of the locale in effect
> when it is matched. If the qr// is inserted in another pattern the
> localeness of the pattern is destroyed.
>
> In short qr// results compiled under use locale have different results
> depending on how they are used. These regexes are also much slower
> than ones not compiled under locale as they have to do a lot more run
> time comparisons to check if they match.
>
>> I don't know whether this is harder on it than
>> it already suffers under the Unicode vs bytes
>> shifts in behavior, but both seem problematic
>> to an annoying degree.
>
> Locale regexes are irritating because you cant precompute them. They
> are defined to change based on your environment which can change in
> between compilation and execution of the regex. So you delay a lot of
> stuff that could be precomputed to inside of the regex matching loop.
>
>> This is why my test program was tricked into
>> thinking \s suddenly started matching VT like
>> \v does, despite decades of historical precedent.
>> I'd forced it into Unicode mode. :(
>
> And this is why we really really want \w and \s and \d to match the
> traditional thing, even if this means requiring people add something
> to older scripts to support the legacy behaviour. You cant tell what a
> pattern does by looking at it, you have to know the internal bit flags
> of the string involved.
>
> Yves
>
>
>
In reading these comments all at once, I'm not sure we are all on the
same page as to the proposal, and what happens now. So, let me state
what I think both are; correct me if I'm wrong:
The way it works now:
With a 'use locale' or on an EBCDIC platform:
they match whatever the C language ctype routines say they match:
isdigit() for \d, isspace() for \s, and isalnum() for \w (but I know \w
adds underscore but I didn't see where it was doing that in a quick scan
of the code).
Absent a 'use locale' and not on an EBCDIC platform:
If (the string being matched against doesn't have the utf8 flag on.
&& the regular expression doesn't contain something that would
make it look like it should behave in utf8 semantics. Any \p{}
in it, for example, will force it into utf8)
{
\d = [0-9]; \w = [_a-zA-Z\d]; \s = [ \t\f\r\n]
} else {
they match what Unicode says, except that there are some bugs so that
\w matches too much, like fractions.
}
What I meant to say was the proposal:
No change to 'use locale' or EBCDIC. Even if we could deprecate 'use
locale', we would be stuck with supporting it in 5.12, I think.
Otherwise, \d = [0-9]; \w = [_a-zA-Z\d]; \s = [ \t\f\r\n]
regardless.
Thread Previous
|
Thread Next