develooper Front page | perl.perl5.porters | Postings from March 2012

[perl #112084] Lots and lots of string/regex-escape bugs, including charnames

From:
tchrist1
Date:
March 28, 2012 13:55
Subject:
[perl #112084] Lots and lots of string/regex-escape bugs, including charnames
Message ID:
rt-3.6.HEAD-4610-1332968110-1932.112084-75-0@perl.org
# New Ticket Created by  tchrist1 
# Please include the string:  [perl #112084]
# in the subject line of all future correspondence about this issue. 
# <URL: https://rt.perl.org:443/rt3/Ticket/Display.html?id=112084 >


SUMMARY: Errors in \x{...} and \N{...} are handled poorly, leading
	 to mysteriously inconsistent and incorrect results.  Also,
         their are parsing bugs in \N{....} that permit all kinds of
         stuff that it shouldn't.
         
This seems wrong:

    $ blead -wle 'print length "\x{}"'
    1
    $ blead -wle 'print ord "\x{}"'
    0

The answer *should* be that that is altogether illegal.  The stupid
thing isn't even giving me a warning.  What's that about?

Contrast with this much more reasonable behavior:

    $ blead -wle 'print ord "\o{}"'
    Number with no digits at -e line 1, within string
    Execution of -e aborted due to compilation errors.

And I cannot see any justification for this:

    $ blead -le 'print $n = () = "8\x{00}8" =~ /\A[\8]{3}\z/'
    1
    $ blead -le 'print $n = () = "g\x{00}g" =~ /\A[\xg]{3}\z/'
    1

Yes, those will generate optional ignorable warnings if you beg it nicely,
but they're *syntax errors*.  It doesn't seem right to emit optional and
by-default-off warnings on syntax errors, for goodness' sake!  The same
thing should happen as with a bad \o{} -- it should die.

This is a poor message, for several reasons:

    % blead -le 'print "." =~ /\N{}/ || "FAIL"'
    Unknown charname '' at -e line 1.
    Deprecated character in \N{...}; marked by <-- HERE  in \N{}<-- HERE  at -e line 1.
    FAIL

First, that looks a great deal like a syntax error to me.  Why isn't
it marked as such and the whole thing dying?  It's not just that it
couldn't find a character name.  It's that it wasn't even given one.
That should be any error.  

There are other weirdness.  I understand what's going on, but it's
a bit surprising.

    % blead -le 'print "." =~ /\N{$a}/ || "FAIL"'
    Unknown charname '$a' at -e line 1.
    Deprecated character in \N{...}; marked by <-- HERE  in \N{$<-- HERE a} at -e line 1.
    FAIL

    % blead -le '$a = 1; print "." =~ /\N{$a}/ || "FAIL"'
    Unknown charname '$a' at -e line 1.
    Deprecated character in \N{...}; marked by <-- HERE  in \N{$<-- HERE a} at -e line 1.
    Name "main::a" used only once: possible typo at -e line 1.
    FAIL

Ok, so variable interpolation doesn't happen in \N{brackets}.  Is that
actually documented?

And what sort of warning *is* that?  How do you control it?

This didn't turn it off:

    % blead -X -le 'print "." =~ /\N{}/ || "FAIL"'
    Unknown charname '' at -e line 1.
    FAIL

Nor did this:

    % blead -M-warnings -le 'print "." =~ /\N{}/ || "FAIL"'
    Unknown charname '' at -e line 1.
    FAIL

And yet this fatalized it:

    % blead -Mwarnings=FATAL,all -le 'print "." =~ /\N{}/ || "FAIL"'
    Unknown charname '' at -e line 1.
    Deprecated character in \N{...}; marked by <-- HERE  in \N{}<-- HERE  at -e line 1.
    Exit 255

And this missed it altogether:

    % perl -Mcharnames=:full -M-warnings -le 'print "." =~ /\N{}/ || "FAIL"'
    Unknown charname '' at -e line 1

    FAIL

So you can't turn it off in any of the normal ways.

It also seems to have an extra newline in the output.

Here diagnostics couldn't catch it:

    % blead -Mcharnames=:full -Mdiagnostics -le 'print "." =~ /\N{}/ || "FAIL"'
    Unknown charname '' at -e line 1

    Deprecated character in \N{...}; marked by <-- HERE  in \N{}<-- HERE  at -e
	    line 1 (#1)
	(D deprecated) Just about anything is legal for the ... in \N{...}.
	But starting in 5.12, non-reasonable ones that don't look like names
	are deprecated.  A reasonable name begins with an alphabetic character
	and continues with any combination of alphanumerics, dashes, spaces,
	parentheses or colons.
	
    FAIL

And there's all that extra newline business still.

That error message is misleading.  It looks like it's saying

    m{
	\A \p{Alphabetic} 
	[\p{Alphabetic}\p{Number}\p{Dash}\N{SPACE}():] +
	\z 
    }z


But that's incorrect.  It appears to actually be this:

    m{
	\A 
	(?= \p{word} )			# must start with a word character, that isn't digit or _
	(?! [\N{LOW LINE}\p{digit}] )   # but *does* admit other numerics and connector punctuation!!
	[\p{word}\N{HYPHEN-MINUS}\N{SPACE}():] *
	\z 
    }x

Which isn't what the message says.  And that star really should be a 
plus, as previously discussed.

Oh wait.  It turns out that this is the real expressions:

    % setenv PERL_UNICODE=SA

    % perl -Mcharnames=:full -le 'print qq(print "\\\N{X\N{EURO SIGN}}")'
    print "\N{X€}"
    % perl -Mcharnames=:full -le 'print qq(print "\\\N{X\N{EURO SIGN}}")' | blead -Mutf8
    Unknown charname 'X€' at - line 1.
    ?
    % perl -le 'print qq(print "\\\N{a\x{fffd}}")' | uniquote -v
    print "\N{a\N{REPLACEMENT CHARACTER}}"
    % perl -le 'print qq(print "\\\N{a\x{fffd}}")' | uniquote -x
    print "\N{a\x{FFFD}}"
    % perl -le 'print qq(print "\\\N{a\x{fffd}}")' | blead -Mutf8
    Unknown charname 'a�' at - line 1.
    % perl -le 'print qq(print "\\\N{a\x{fffd}##x%}")' | blead -Mutf8 -l
    Unknown charname 'a�##x%' at - line 1.
    Deprecated character in \N{...}; marked by <-- HERE  in \N{a�#<-- HERE #x%} at - line 1.
    ?
    % perl -Mcharnames=:full -le 'print qq(print "\\\N{X\N{EURO SIGN}\N{REPLACEMENT CHARACTER}\N{PESO SIGN}\N{ALIEN MONSTER}}")' | blead -Mutf8 -l
    Unknown charname 'X€�₱👾' at - line 1.
    ?
    % perl -Mcharnames=:full -le 'print qq(print "\\\N{X\N{EURO SIGN}\N{REPLACEMENT CHARACTER}\N{PESO SIGN}\N{ALIEN MONSTER}.}")' | blead -Mutf8 -l
    Unknown charname 'X€�₱👾.' at - line 1.
    Deprecated character in \N{...}; marked by <-- HERE  in \N{X€�₱👾.<-- HERE } at -
    line 1.
    ?

First, Perl is still double-encoding UTF-8 on stderr.  Is that ever going
to be fixed?

But most importantly, this is all messed up, and in all kinds of ways.  Once
you start with a word character that isn't a low line or a digit (but other
\p{Pc} and \pN code points are ok though, like U+203F UNDERTIE and the Roman
numerals), you can have anything you want -- *if* it's over Latin1!!  I've
tracked this down to a comment in toke.c, in the charname branch for source in
utf8, which reads:

     * ... We accept anything above the latin1
     * range because it is immaterial to Perl if it is
     * correct or not, and is expensive to check.

So apparently this is intentional.  Is this intentionally un(der)documented?

In any event, I'm surprised you can start with underties or Roman numerals. 
Yes, the Roman numerals are alphabetic, but the undertie seems strange.
I think we're using word characters but claiming alphanums.  

Actually, it seems like you can use any >Latin1 char you wnat:

    % blead -Mutf8 -le 'print ord "\N{⁶}"'
    Unknown charname '⁶' at -e line 1.
    65533

    % blead -Mutf8 -le 'print ord "\N{€}"'
    Unknown charname '€' at -e line 1.
    65533

    % blead -Mutf8 -le 'print ord "\N{👽}"'
    Unknown charname '👽' at -e line 1.
    65533

Surely the stderr double-encoding should go!  It's because of STDERR already
being in UTF8, because I have PERL_UNICODE=SA.  Here's the explicit override:

    % blead -C0 -Mutf8 -le 'print ord "\N{👽}"'
    Unknown charname '👽' at -e line 1.
    65533
    % blead -CS -Mutf8 -le 'print ord "\N{👽}"'
    Unknown charname '👽' at -e line 1.
    65533

That's not right.

Hm, can I *use* a space alien for a name??

    % blead -C0 -Mutf8 -wle 'use charnames ":full", ":alias" => { "👽" => "DOLLAR SIGN" }; print ord "\N{👽}"'
    Unknown charname '👽' at -e line 1.
    65533

Apparently not.   But why didn't it warn me on the use charnames if didn't
like the "👽"  as a key?

Enough of that.  Moving right along...

Hm, why is a bogus property a syntax error, but a bogus charname isn't?

    % blead -Mcharnames=:full -le 'print "." =~ /\p{none such}/ || "FAIL"'
    Can't find Unicode property definition "none such" at -e line 1.
    Exit 255

And why is this so inconsistent?

    % blead -le 'print "." =~ /\x{DEADBEEF CAKE}/ || "FAIL"'
    FAIL

    % blead -le 'print "." =~ /\N{U+DEADBEEF CAKE}/ || "FAIL"'
    Invalid hexadecimal number in \N{U+...} at -e line 1, within pattern
    Execution of -e aborted due to compilation errors.
    Exit 255

Surely the first should produce the same result as the second: death.

This is also inconsistent:

    % perl -Mdiagnostics -wle 'print "." =~ /\x{DEADBEEF CAKE}/ || "FAIL"'
    Illegal hexadecimal digit ' ' ignored at -e line 1 (#1)
	(W digit) You may have tried to use a character other than 0 - 9 or
	A - F, a - f in a hexadecimal number.  Interpretation of the hexadecimal
	number stopped before the illegal character.
    
    FAIL

    % perl -Mdiagnostics -wle 'print "." =~ /\o{DEADBEEF CAKE}/ || "FAIL"'
    Non-octal character 'D'.  Resolved as "\o{}" at -e line 1 (#1)
	(W digit)  In parsing an octal numeric constant, a character was
	unexpectedly encountered that isn't octal.  The resulting value is as
	indicated.
    
    FAIL

And what's this about?

    % blead -wle 'print 0x - 1'
    -1

    % blead -wle 'print 0b - 1'
    -1

    % blead -wle 'print 0xx2'
    00

Surely that's again an error?  Those are the same kind of thing as this:

    % blead -wle 'print 3.14e'
    Bareword found where operator expected at -e line 1, near "3.14e"
	    (Missing operator before e?)
    Unquoted string "e" may clash with future reserved word at -e line 1.
    syntax error at -e line 1, near "3.14e
    "
    Execution of -e aborted due to compilation errors.
    Exit 255

--tom

-- 

Summary of my perl5 (revision 5 version 14 subversion 0) configuration:
   
  Platform:
    osname=openbsd, osvers=4.4, archname=OpenBSD.i386-openbsd
    uname='openbsd chthon 4.4 generic#0 i386 '
    config_args='-des'
    hint=recommended, useposix=true, d_sigaction=define
    useithreads=undef, usemultiplicity=undef
    useperlio=define, d_sfio=undef, uselargefiles=define, usesocks=undef
    use64bitint=undef, use64bitall=undef, uselongdouble=undef
    usemymalloc=y, bincompat5005=undef
  Compiler:
    cc='cc', ccflags ='-fno-strict-aliasing -pipe -fstack-protector -I/usr/local/include',
    optimize='-O2',
    cppflags='-fno-strict-aliasing -pipe -fstack-protector -I/usr/local/include'
    ccversion='', gccversion='3.3.5 (propolice)', gccosandvers='openbsd4.4'
    intsize=4, longsize=4, ptrsize=4, doublesize=8, byteorder=1234
    d_longlong=define, longlongsize=8, d_longdbl=define, longdblsize=12
    ivtype='long', ivsize=4, nvtype='double', nvsize=8, Off_t='off_t', lseeksize=8
    alignbytes=4, prototype=define
  Linker and Libraries:
    ld='cc', ldflags ='-Wl,-E  -fstack-protector -L/usr/local/lib'
    libpth=/usr/local/lib /usr/lib
    libs=-lgdbm -lm -lutil -lc
    perllibs=-lm -lutil -lc
    libc=/usr/lib/libc.so.48.0, so=so, useshrplib=false, libperl=libperl.a
    gnulibc_version=''
  Dynamic Linking:
    dlsrc=dl_dlopen.xs, dlext=so, d_dlsymun=undef, ccdlflags=' '
    cccdlflags='-DPIC -fPIC ', lddlflags='-shared -fPIC  -L/usr/local/lib -fstack-protector'


Characteristics of this binary (from libperl): 
  Compile-time options: MYMALLOC PERL_DONT_CREATE_GVSV PERL_MALLOC_WRAP
                        PERL_PRESERVE_IVUV USE_LARGE_FILES USE_PERLIO
                        USE_PERL_ATOF
  Built under openbsd
  Compiled at Jun 11 2011 11:48:28
  %ENV:
    PERL_UNICODE="SA"
  @INC:
    /usr/local/lib/perl5/site_perl/5.14.0/OpenBSD.i386-openbsd
    /usr/local/lib/perl5/site_perl/5.14.0
    /usr/local/lib/perl5/5.14.0/OpenBSD.i386-openbsd
    /usr/local/lib/perl5/5.14.0
    /usr/local/lib/perl5/site_perl/5.12.3
    /usr/local/lib/perl5/site_perl/5.11.3
    /usr/local/lib/perl5/site_perl/5.10.1
    /usr/local/lib/perl5/site_perl/5.10.0
    /usr/local/lib/perl5/site_perl/5.8.7
    /usr/local/lib/perl5/site_perl/5.8.0
    /usr/local/lib/perl5/site_perl/5.6.0
    /usr/local/lib/perl5/site_perl/5.005
    /usr/local/lib/perl5/site_perl
    .




nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About