develooper Front page | perl.perl5.porters | Postings from April 2011

Is perldelta missing an incompatibility entry for \cX?

Thread Next
From:
Tom Christiansen
Date:
April 22, 2011 19:23
Subject:
Is perldelta missing an incompatibility entry for \cX?
Message ID:
1770.1303525388@chthon
I have discovered what looks to me like an incompatible change that
perldelta is missing an entry for.  It only says it is a deprecation, 
but it seems in some instances incompatible.  I don't want to go back
to the old way or anything, though.  And there's still a compiler bug.

Here's the appropriate podtoc excerpt:

 Incompatible Changes
     Regular Expressions and String Escapes
         \400-\777
         Most C<\p{}> properties are now immune to case-insensitive matching
         \p{} implies Unicode semantics
         Regular expressions retain their localeness when interpolated
         Stringification of regexes has changed
         Run-time code blocks in regular expressions inherit pragmata
     Stashes and Package Variables
         Localised tied hashes and arrays are no longed tied
         Stashes are now always defined
         Clearing stashes
         Dereferencing typeglobs
         Magic variables outside the main package
         local($_) strips all magic from $_
         Parsing of package and variable names
     Changes to Syntax or to Perl Operators
         C<given> return values
         Change in parsing of certain prototypes
         Smart-matching against array slices
         Negation treats strings differently from before
         Negative zero
         C<:=> is now a syntax error
         Change in the parsing of identifiers
     Threads and Processes
         Directory handles not copied to threads
         C<close> on shared pipes
         fork() emulation will not wait for signalled children
     Configuration
         Naming fixes in Policy_sh.SH may invalidate Policy.sh
         Perl source code is read in text mode on Windows

See?  No mention of \c.  That happens here:

 Deprecations
     Omitting a space between a regular expression and subsequent word
     C<\cI<X>>
     C<"\b{"> and C<"\B{">
     Deprecation warning added for deprecated-in-core Perl 4-era .pl libraries
     List assignment to C<$[>
     Use of qw(...) as parentheses
     C<\N{BELL}>
     C<?PATTERN?>
     Tie functions on scalars holding typeglobs
     User-defined case-mapping
     Deprecated modules
     *  L<Devel::DProf>

Why isn't it listed as an incompatibility, only as a deprecation?

    =head2 C<\cI<X>>

    The backslash-c construct was designed as a way of specifying
    non-printable characters, but there were no restrictions (on ASCII
    platforms) on what the character following the C<c> could be.  Now,
    a deprecation warning is raised if that character isn't an ASCII character.
    Also, a deprecation warning is raised for C<"\c{"> (which is the same
    as simply saying C<";">).

I can make the compiler blow up now.

I do not know quite what to insert there, but I think something
should be.  It has to do with the new way that \cX is treated,
which now can cause very different behavior, including a couple
of kinds of compiler explosions.

I do understand the new warning when you take a Control-é 
to make a ©, or vice versa:

    % perl5.12.3 -C0 -E '$s = q("\c) . chr(0xE9). q("); say "\$x = $s; printf qq(%X\n), ord \$x"' | perl5.12.3 -C0 
    A9

or

    % perl -C0 -E '$s = q("\c) . chr(0xE9). q("); say "\$x = $s; printf qq(%s is %X\n), \$x, ord \$x"' | perl -CS
    © is A9

but now it complains (but does it anyway, of course):

    % blead -C0 -E '$s = q("\c) . chr(0xE9). q("); say "\$x = $s; printf qq(%X\n), ord \$x"' | blead -C0
    Character following "\c" must be ASCII at - line 1.
    A9

However, I do not understand why the parser is doing this:

    % perl5.12.3 -CS -E '$s = q("\c) . chr(0xE9). q("); say "\$x = $s; printf qq(%X\n), ord \$x"' | perl5.12.3 -CS -Mutf8
    Malformed UTF-8 character (unexpected continuation byte 0xa9, with no preceding start byte) at - line 1.
    83

    % blead -CS -E '$s = q("\c) . chr(0xE9). q("); say "\$x = $s; printf qq(%X\n), ord \$x"' | blead -CS -Mutf8
    Character following "\c" must be ASCII at - line 1.
    Malformed UTF-8 character (unexpected continuation byte 0xa9, with no preceding start byte) at - line 1.
    83

Well, actually I *do*: it's because U+E9 is \xC3\xA9 as UTF-8, and if 
you xor 0xC3 with an '@', which is 0x40 or 64, then you do get 0x83.

In yylex(), we have this:

            /* \c is a control character */
            case 'c':
                s++;
                if (s < send) {
                    *d++ = grok_bslash_c(*s++, has_utf8, 1);
                }
                else {
                    yyerror("Missing control char name in \\c");
                }
                continue;

And then here:

    STATIC char
    S_grok_bslash_c(pTHX_ const char source, const bool utf8, const bool output_warning)
    {

        U8 result;

        if (utf8) {
            /* Trying to deprecate non-ASCII usages.  This construct has never
             * worked for a utf8 variant.  So, even though are accepting non-ASCII
             * Latin1 in 5.14, no need to make them work under utf8 */
            if (! isASCII(source)) {
                Perl_croak(aTHX_ "Character following \"\\c\" must be ASCII");
            }
        }

        result = toCTRL(source);
        if (! isASCII(source)) {
                Perl_ck_warner_d(aTHX_ packWARN2(WARN_DEPRECATED, WARN_SYNTAX),
                                "Character following \"\\c\" must be ASCII");
        }

But that is not what I see happening.  It isn't croaking.  It continues.
Oh wait!  It didn't know it was utf8?  

    % blead -C0 -E '$s = q("\c) . chr(0xE9). q("); say "\$x = $s; printf qq(%X\n), ord \$x"' | blead -M5.14.0 -C0
    Character following "\c" must be ASCII at - line 1.
    Global symbol "$x" requires explicit package name at - line 1.
    Global symbol "$x" requires explicit package name at - line 2.
    Execution of - aborted due to compilation errors.

Ok, *now* it croaks.  But I don't understand why.  I haven't said use utf8.
Why does it think that we're in utf8 in the previous version?  Because of
unicode_strings?  That's an awfully big difference in parsing, eh!  Without 
unicode_strings, it gets me a copyright symbol, and with it, I get a croak.

I'm not saying this we should go back to the ugly thing that Java does
(see below).  But I don't believe there is really a malformed UTF-8
character anywhere, so I don't think it should say that.  I'm still
trying to decide whether the parser croaking if and only if I say
unicode_strings irrespective of whether I use utf8 source makes sense.

I think perldelta needs to say something about this.

Karl, how did you come across all the weirdness with \cX?  Was it
because the regex compiler has to recognize it on its own, separate from
qq interpolation?

I always thought that Control-X meant ('X'^'@') in C, so (ord("X") ^
ord("@") in Perl, and that the reason for writing  ^C is because of the
xor.  I notice there's no mention of xor'ing with an "@".  I like that
explanation because it explains why a ^@ is NUL, why ^? is DEL, etc.

Did you know that the Java regex compiler simply xors that way no matter
what the input or output code point should happen to be?  Specifically, 
it does this in java/util/regex/Pattern.java:

    private int c() {
        if (cursor < patternLength) {
            return read() ^ 64;
        }
        throw error("Illegal control escape sequence");
    }

See how indiscriminant that is?  

This means that "\cC" is ord("C")^64 or chr(3), but "\cc" is ord("c")^64
so "#" -- and thus also \c# is "c".  It also means that "\cé" is "©" and
"\c©" is "é", "\cα" (alpha) is "ϱ" (rho symbol) and "\cϱ" is "α", etc.

But there is one thing you have to give it credit for: it doesn't get
confused about UTF-8 boundaries.  Its internal read() used there is
guaranteed to return a character, not a piece of a character. :(

And people think I'm too hard on Java regex compiler! That's
not true: I'm hard on everybody's compilers. :)

--tom

Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About