develooper Front page | perl.perl5.porters | Postings from October 2009

Re: What should \s \w \d match in 5.12?

Thread Previous | Thread Next
From:
demerphq
Date:
October 5, 2009 12:22
Subject:
Re: What should \s \w \d match in 5.12?
Message ID:
9b18b3110910051221k340a2f68u5be9e1e05457e7ae@mail.gmail.com
2009/10/5 Tom Christiansen <tchrist@perl.com>:
>>\w should be its historical meaning
>
> Careful: wouldn't historical meaning include
> locales, wherein \w would also include (for example)
> é and ç in French, ñ in Spanish, ß in German,
> and ð and þ in Icelandic?  And didn't we already
> find that locale-shifting char classes made
> life really hard on the regex engine (at least)?

use locale is in some respects broken by qr//, as it doesnt use regex
flags and depends on the context it is compiled within.

So for instance, if you use local and the have a sub return a qr//
compiled regex and then use that object alone in a match anywhere that
you pass it it will match using the semantics of the locale in effect
when it is matched. If the qr// is inserted in another pattern the
localeness of the pattern is destroyed.

In short qr// results compiled under use locale have different results
depending on how they are used. These regexes are also much slower
than ones not compiled under locale as they have to do a lot more run
time comparisons to check if they match.

> I don't know whether this is harder on it than
> it already suffers under the Unicode vs bytes
> shifts in behavior, but both seem problematic
> to an annoying degree.

Locale regexes are irritating because you cant precompute them. They
are defined to change based on your environment which can change in
between compilation and execution of the regex. So you delay a lot of
stuff that could be precomputed to inside of the regex matching loop.

> This is why my test program was tricked into
> thinking \s suddenly started matching VT like
> \v does, despite decades of historical precedent.
> I'd forced it into Unicode mode.  :(

And this is why we really really want \w and \s and \d to match the
traditional thing, even if this means requiring people add something
to older scripts to support the legacy behaviour. You cant tell what a
pattern does by looking at it, you have to know the internal bit flags
of the string involved.

Yves



-- 
perl -Mre=debug -e "/just|another|perl|hacker/"

Thread Previous | Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About