develooper Front page | perl.perl5.porters | Postings from July 2008

Alarums and Excursions (was [perl #2783] Security of ARGV using 2-argument open)

Thread Next
From:
Tom Christiansen
Date:
July 27, 2008 18:59
Subject:
Alarums and Excursions (was [perl #2783] Security of ARGV using 2-argument open)
Message ID:
12431.1217210364@chthon
In-Reply-To: Message from Zefram <zefram@fysh.org> 
   of "Sat, 26 Jul 2008 23:18:19 BST." <20080726221819.GA15269@fysh.org> 

> I think <> could be changed, however.  I'm OK with it continuing to
> process "-" as stdin; you're right that this is a common Unix
> convention.  But its handling of ">foo" and "rm -rf / |" are certainly
> not conventional.  I think (unlike /$/) that the intentional uses of
> those features are sufficiently rare that it's worth breaking them to
> make the operator less surprising for everyone else.

That, I'll get to that later.

> I am mystified as to the circumstances under which one might actually
> want the behaviour of /$/ (without /m) or /^/m.  

Really?  I wonder why, as those are easily demonstrated.

For /(?-m:$)/, we have simply:

    while (<>) {
	last if /^END$/;
	...
    } 

Why?  Well, why should you chomp if you aren't going to use it anyway?

For /(?m:^)/, or for that matter, /(?-s:.)/, this is to aid dealing with
multiline records.

    $/ = q##;
    while (<>) {
	if (/^STATE\s+(.*)$/m) {   # might be in the middle of the rec
	    $state = $1;
	}
    } 

It's a lot easier to type ^ and $ than \A and \Z I mean \z, and usually
these suffice.  Which is good, for otherwise we'd have to invert their
meanings and break all hope of fulfilling fair expectation.

> Certainly they can be correctly used, with a bit of care, but as far as
> I can see they never completely match the actual semantics of what
> constitutes a line start or end.

I'd say that you can't see very far, for to my eye, they do indeed.
It's not complicated, even though you seem to want it to be.

You want the traditional behavior of /$/ because /$/ in ed, sed, awk, vi,
lex, grep, etc has always anchored to the end of the line.  And those
logical lines do *not* end in a literal (hm, actually, it's virtual) \n
that's user-accessible.  So /foo$/ meant the foo at the end of the line.
In Perl, it still does--BY DESIGN.  Larry wanted people to be able to use
the same patterns they'd always used.  And he did, and they did, and all
were content with this for many years.

Maybe it's a Ken-thing (Ken Thompson, that is).  When Rob Pike wrote the
Plan 9 editor, sam, *Rob* sure wasn't content with it.  He specifically
wanted to let the user's patterns transcend mere line boundaries, and he
had to change a few ways of thinking to do so.  I no longer recall whether
these including . and $, but you could look it up.

There's a specific connection between all these elements of Perl:

    while (<>) {
	next if /^#/ || /SKIP$/;
	@fields = split;
	.....
    } 

That connection is that they are "newline-tolerant", not "newline-
sensitive".  They are liberal in what they consume, per the maxim. And
don't forget that easy things should be easy.

They have built-in conveniences both for domain-specific programming of the
filter variety, and also for those who forget to chop (now chomp), since in
previous programming endeavors using the shell, or sed, or awk, or even C's
deprecated gets(), the newline *WAS*NOT*STORED*.  And so people, especially
EXPERIENCED PEOPLE, *will* forget to chop or chomp.  

It was therefore a CYA DWIM convenience for this very scenario to have
split discard trailing null fields (after all, awk doesn't consider there
to be an extra null field after the newline) and to have /$/ permit one to
have one's newline and eat it, too.

I'm not just making this up, either.  I specifically asked Larry on 
these historical matters of design decisions, and this was what he told me.

> I think these (<> and some of the regexp things) are unreasonably
> difficult to understand.  

While I see that you yourself have some difficulty, one cannot--and should
not--casually extrapolate one particular user's conceptual troubles to an
entire community or user-base.  

Perhaps it was not well-explained to you.  I don't know.  But I firmly
believe that programmers who'd rather write, or see written, this
sort of sequence:

    if (@ARGV == 0) { 
	@ARGV = ("-");
    }

ARGUMENT:
    while (@ARGV != 0) {
	$ARGV = shift(@ARGV);        
	$ARGV = "<&=STDIN" if $ARGV eq "-";
	# that's an fdopen(3S); use <&STDIN (or <&0)for dup2(2)
	unless (open(ARGV, $ARGV)) {
	    print STDERR "Can't open $ARGV: $!\n";
	    next ARGUMENT;
	}

LINE:   
	while (defined($line = readline(*ARGV))) {
	    if ($line =~ /^=for\s+(index|later)/) {
		next LINE;
	    } 
	    $chars = $chars + length($line);
	    $words = $words + split(" ", $line, 0);
	    $lines = $lines + ($line =~ tr[\n][]);
	}
    }

instead of just 

    while (<>) {
        next if /^=for\s+(index|later)/;
        $chars += length;
        $words += split;
        $lines += y/\n//;
    }

or even better

    #!/usr/bin/perl -n
    next if /^=for\s+(index|later)/;
    $chars += length;
    $words += split;
    $lines += y/\n//;

are few and far between.  They do not understand Perl's *spirit*, and
probably never shall.  Probably they've long-ago abandoned Perl (I sure
hope so, for all of our sakes' :-) due to their cognitive impedance
(fancy-talk for head-banging) with its philosophy: perhaps moving to
Python, perhaps moving to C, perhaps simply moving out of programming.

Yes, you'd have to add an END{} block for emitting the counts for
characters, words, and lines in the perl -n set-up, but big deal.  

I think this a just demonstration of what Perl's input operator was
designed for, both for how it interacts with defaults on other functions,
operators, and operations and also for the synergistic convenience of these
working together--which, even if you choose to consider it a
domain-specific filter-like language, is still an admirable achievement.

The implicit versions are dramatically easier to maintain and understand --
assuming one has the least experience with the language.  Learn once, use
many.  You don't read a French novel without knowing French.

> /^/m is so difficult to understand that its
> own implementors have trouble with it.  

You think so?  I'd say that /\b$VAR/ or /$VAR\B/ is far harder to explain
to someone than m/^/m is.  But boundaries are always tough.  That $VAR
might be "fred" or "+=" (or "cat's" vs "cats'"!) deeply changes what sort
of assertion \b is making is a tough thing.  Also, string ends are always
boundaries no matter what, which isn't quite what you for one might expect.

    % perl1      -e 'print "abc" =~ /c\b/ || 0, "\n";'
    1
    % perl4.036  -e 'print "abc" =~ /c\b/ || 0, "\n";'
    1
    % perl5.10.0 -e 'print "abc" =~ /c\b/ || 0, "\n";'
    1

And yes, that's newline-tolerant. :-)  Given that "c" is a /\w/ char,
you'd perhaps think that the \b might only mean that there must be a
/\W/ following it--but it doesn't.  This may even be similar after a
certain fashion to whatever it is you're grumping over m/^/m about.

This isn't true for the other flavor of \b, though.

    % perl1      -e 'print "===" =~ /=\b/ || 0, "\n";'
    0
    % perl4.036  -e 'print "===" =~ /=\b/ || 0, "\n";'
    0
    % perl5.10.0 -e 'print "===" =~ /=\b/ || 0, "\n";'
    0

> It was documented incorrectly in perlre for years, until I discovered
> the undocumented /(?!\z)/ bit of its behaviour and pointed it out (bug
> #27053, resolved by a documentation change in 5.10).

Undocumented?  Of that I'm not sure, and I couldn't see a unidiff that
showed a doc change.  But what's happening is far more simple than you make
it out to be (per usual).  Perl is merely being newline-tolerant again.
It's trying to follow what people are expecting to happen if they pulled
that string into their editor and set line numbers on.   And it does.

    % perl -le 'print scalar ( () = ("abc" =~ /^/mg) )'
    1
    % perl -le 'print scalar ( () = ("abc\n" =~ /^/mg) )'
    1

    % perl -le 'print scalar ( () = ("\nabc" =~ /^/mg) )'
    2
    % perl -le 'print scalar ( () = ("\nabc\n" =~ /^/mg) )'
    2
    % perl -le 'print scalar ( () = ("abc\ndef\n" =~ /^/mg) )'
    2
    % perl -le 'print scalar ( () = ("abc\ndef" =~ /^/mg) )'
    2

    % perl -le 'print scalar ( () = ("abc\ndef\n\n" =~ /^/mg) )'
    3

    % perl -le 'print scalar ( () = ("\nabc\n\ndef\n\n" =~ /^/mg) )'
    5

> I understand these operators.  I will jump through the necessary hoops to
> write the program that I intend, even if the hoop is using a 20-character
> regexp (as in the table above) instead of the single character that
> I know from grep.  Not to put too fine a point on it, I'm unburned
> because I know the language inside out and I'm anal about correctness.

I'm uncertain you needed those last couple of words, and the earlier 
ones don't particularly help your case much, either--considering.

> I'm in a small minority on all three points.

Not to mention modesty, which is a good thing you didn't.

>> Users who don't read the document will always be surprised.

> Users who read the documentation for perl programs that use <> (such
> as our hypothetical wc-in-perl) generally don't get told about the
> magic meaning of "rm -rf / |" as an argument.  They will be surprised
> and burned by the present behaviour.

No, they won't; you're being absurd and alarmist.  But as I've had quite
enough of your ranting for the night, that means you get to wait until my
morrow, or more, to learn why you're wrong--and how.

Relish that interim.

--tom

-- 
    END { close(STDOUT) || die "can't close STDOUT: $!" }

Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About