Front page | perl.perl5.porters |
Postings from April 2011
Is perldelta missing an incompatibility entry for \cX?
Thread Next
From:
Tom Christiansen
Date:
April 22, 2011 19:23
Subject:
Is perldelta missing an incompatibility entry for \cX?
Message ID:
1770.1303525388@chthon
I have discovered what looks to me like an incompatible change that
perldelta is missing an entry for. It only says it is a deprecation,
but it seems in some instances incompatible. I don't want to go back
to the old way or anything, though. And there's still a compiler bug.
Here's the appropriate podtoc excerpt:
Incompatible Changes
Regular Expressions and String Escapes
\400-\777
Most C<\p{}> properties are now immune to case-insensitive matching
\p{} implies Unicode semantics
Regular expressions retain their localeness when interpolated
Stringification of regexes has changed
Run-time code blocks in regular expressions inherit pragmata
Stashes and Package Variables
Localised tied hashes and arrays are no longed tied
Stashes are now always defined
Clearing stashes
Dereferencing typeglobs
Magic variables outside the main package
local($_) strips all magic from $_
Parsing of package and variable names
Changes to Syntax or to Perl Operators
C<given> return values
Change in parsing of certain prototypes
Smart-matching against array slices
Negation treats strings differently from before
Negative zero
C<:=> is now a syntax error
Change in the parsing of identifiers
Threads and Processes
Directory handles not copied to threads
C<close> on shared pipes
fork() emulation will not wait for signalled children
Configuration
Naming fixes in Policy_sh.SH may invalidate Policy.sh
Perl source code is read in text mode on Windows
See? No mention of \c. That happens here:
Deprecations
Omitting a space between a regular expression and subsequent word
C<\cI<X>>
C<"\b{"> and C<"\B{">
Deprecation warning added for deprecated-in-core Perl 4-era .pl libraries
List assignment to C<$[>
Use of qw(...) as parentheses
C<\N{BELL}>
C<?PATTERN?>
Tie functions on scalars holding typeglobs
User-defined case-mapping
Deprecated modules
* L<Devel::DProf>
Why isn't it listed as an incompatibility, only as a deprecation?
=head2 C<\cI<X>>
The backslash-c construct was designed as a way of specifying
non-printable characters, but there were no restrictions (on ASCII
platforms) on what the character following the C<c> could be. Now,
a deprecation warning is raised if that character isn't an ASCII character.
Also, a deprecation warning is raised for C<"\c{"> (which is the same
as simply saying C<";">).
I can make the compiler blow up now.
I do not know quite what to insert there, but I think something
should be. It has to do with the new way that \cX is treated,
which now can cause very different behavior, including a couple
of kinds of compiler explosions.
I do understand the new warning when you take a Control-é
to make a ©, or vice versa:
% perl5.12.3 -C0 -E '$s = q("\c) . chr(0xE9). q("); say "\$x = $s; printf qq(%X\n), ord \$x"' | perl5.12.3 -C0
A9
or
% perl -C0 -E '$s = q("\c) . chr(0xE9). q("); say "\$x = $s; printf qq(%s is %X\n), \$x, ord \$x"' | perl -CS
© is A9
but now it complains (but does it anyway, of course):
% blead -C0 -E '$s = q("\c) . chr(0xE9). q("); say "\$x = $s; printf qq(%X\n), ord \$x"' | blead -C0
Character following "\c" must be ASCII at - line 1.
A9
However, I do not understand why the parser is doing this:
% perl5.12.3 -CS -E '$s = q("\c) . chr(0xE9). q("); say "\$x = $s; printf qq(%X\n), ord \$x"' | perl5.12.3 -CS -Mutf8
Malformed UTF-8 character (unexpected continuation byte 0xa9, with no preceding start byte) at - line 1.
83
% blead -CS -E '$s = q("\c) . chr(0xE9). q("); say "\$x = $s; printf qq(%X\n), ord \$x"' | blead -CS -Mutf8
Character following "\c" must be ASCII at - line 1.
Malformed UTF-8 character (unexpected continuation byte 0xa9, with no preceding start byte) at - line 1.
83
Well, actually I *do*: it's because U+E9 is \xC3\xA9 as UTF-8, and if
you xor 0xC3 with an '@', which is 0x40 or 64, then you do get 0x83.
In yylex(), we have this:
/* \c is a control character */
case 'c':
s++;
if (s < send) {
*d++ = grok_bslash_c(*s++, has_utf8, 1);
}
else {
yyerror("Missing control char name in \\c");
}
continue;
And then here:
STATIC char
S_grok_bslash_c(pTHX_ const char source, const bool utf8, const bool output_warning)
{
U8 result;
if (utf8) {
/* Trying to deprecate non-ASCII usages. This construct has never
* worked for a utf8 variant. So, even though are accepting non-ASCII
* Latin1 in 5.14, no need to make them work under utf8 */
if (! isASCII(source)) {
Perl_croak(aTHX_ "Character following \"\\c\" must be ASCII");
}
}
result = toCTRL(source);
if (! isASCII(source)) {
Perl_ck_warner_d(aTHX_ packWARN2(WARN_DEPRECATED, WARN_SYNTAX),
"Character following \"\\c\" must be ASCII");
}
But that is not what I see happening. It isn't croaking. It continues.
Oh wait! It didn't know it was utf8?
% blead -C0 -E '$s = q("\c) . chr(0xE9). q("); say "\$x = $s; printf qq(%X\n), ord \$x"' | blead -M5.14.0 -C0
Character following "\c" must be ASCII at - line 1.
Global symbol "$x" requires explicit package name at - line 1.
Global symbol "$x" requires explicit package name at - line 2.
Execution of - aborted due to compilation errors.
Ok, *now* it croaks. But I don't understand why. I haven't said use utf8.
Why does it think that we're in utf8 in the previous version? Because of
unicode_strings? That's an awfully big difference in parsing, eh! Without
unicode_strings, it gets me a copyright symbol, and with it, I get a croak.
I'm not saying this we should go back to the ugly thing that Java does
(see below). But I don't believe there is really a malformed UTF-8
character anywhere, so I don't think it should say that. I'm still
trying to decide whether the parser croaking if and only if I say
unicode_strings irrespective of whether I use utf8 source makes sense.
I think perldelta needs to say something about this.
Karl, how did you come across all the weirdness with \cX? Was it
because the regex compiler has to recognize it on its own, separate from
qq interpolation?
I always thought that Control-X meant ('X'^'@') in C, so (ord("X") ^
ord("@") in Perl, and that the reason for writing ^C is because of the
xor. I notice there's no mention of xor'ing with an "@". I like that
explanation because it explains why a ^@ is NUL, why ^? is DEL, etc.
Did you know that the Java regex compiler simply xors that way no matter
what the input or output code point should happen to be? Specifically,
it does this in java/util/regex/Pattern.java:
private int c() {
if (cursor < patternLength) {
return read() ^ 64;
}
throw error("Illegal control escape sequence");
}
See how indiscriminant that is?
This means that "\cC" is ord("C")^64 or chr(3), but "\cc" is ord("c")^64
so "#" -- and thus also \c# is "c". It also means that "\cé" is "©" and
"\c©" is "é", "\cα" (alpha) is "ϱ" (rho symbol) and "\cϱ" is "α", etc.
But there is one thing you have to give it credit for: it doesn't get
confused about UTF-8 boundaries. Its internal read() used there is
guaranteed to return a character, not a piece of a character. :(
And people think I'm too hard on Java regex compiler! That's
not true: I'm hard on everybody's compilers. :)
--tom
Thread Next
-
Is perldelta missing an incompatibility entry for \cX?
by Tom Christiansen