develooper Front page | perl.perl5.porters | Postings from October 2009

The special charclasses \s \w and \d.

Thread Next
From:
demerphq
Date:
October 17, 2009 13:35
Subject:
The special charclasses \s \w and \d.
Message ID:
9b18b3110910171335j40b66a0cq5b9670e4c98a7e1a@mail.gmail.com
I have basically fixed the failing new tests.

I have not uploaded the fix because these fixes break other tests, and
so far I havent been able to review them all for correctness and fix
them or their underlying breakage.

Also, I start to doubt the tenability of this path. Making \d mean
[0-9] seems to me to be clear (please speak up if anyone disagree).
Making \w have the strict [A-Za-z0-9_] behavior by default is looking
less sensible than it seemed at first. (Mea culpa) Unfortunately
having it always mean its unicode interpretation seem prima-facia
untenable as well, and leaving it as is leads to IMO irresolvable
logical contradictions in the regex engine. Similar problem with \s.
So this means we have to do this the hard way.

That is, we are going to have to introduce new regex modifiers to
control the syntax and continue to at least support the inconsistent
semantics, if not also continue to default to the broken semantics
:-(. We currently have one bit to control whether regexes are compiled
under locale. This effectively means that when the bit is off that it
means we get the current "broken" semantics, and that if we add
another bit we get a four way switch that can be controlled by
modifers, and we can set up a pragma to control the default semantics.
Possibly make it so use perl 5.12 changes the default.

00 - legacy utf8/perl semantics (possibly inconsistent)  use re
'legacy'; no locale;  possible modifier: /B (for b0rked)
01 - regex compiled under use locale: use locale; use re 'locale';
possible modifier /L
10 - unicode semantics: use re 'unicode'; possible modifier /U
11 - ascii/perl semantics: use re 'ascii'; possible modifier /A

This has a lot of run on consequences tho. Charclass structs have to
in the worst case be made larger by 4 bytes. Some reorganization of
the pmop->flags field and the re->extflags field, new opcodes, and
more code in the regex engine.

The bad news is that doing all the above is a reasonable amount of
work and probably not going to happen in time for the next release.
The good news is obviously that the backwards compatibility problems
would be much lower.

What i will do however is push my changes as a branch, so people can
see what perl looks like with just this bug fixed and the restrictive
"ascii-perl" semantics imposed on \w and \s and a few of the tests
changed in trivial ways to pass with the new semantics.

cheers,
Yves





-- 
perl -Mre=debug -e "/just|another|perl|hacker/"

Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About