develooper Front page | perl.perl5.porters | Postings from July 2008

cross-platform line-endings (was: Removing files called minus)

Thread Next
From:
Tom Christiansen
Date:
July 30, 2008 13:12
Subject:
cross-platform line-endings (was: Removing files called minus)
Message ID:
5953.1217448746@chthon
In-Reply-To: Message from Tels <nospam-abuse@bloodgate.com> 
   of "Wed, 30 Jul 2008 20:04:58 +0200." <200807302005.05971@bloodgate.com> 

> I'd like to inject here that chomp() is not enough to get rid of
> Windows or Macintosh-style newlines.

Well, it depends, but that's a good injection.

> So you want:

>       $_ =~ s/[\r\n]//g;      # or whatever

If you're working on text-records, perhaps with $/ = q##, I think
you might prefer:

    s/[\r\n]+/ /g;

So you still have a break between "foo\nbar" when it becomes "foo bar".

The join algorithm in fmt or (n)vi (but not vim the dratteed!) is
much more clever about whether it should change newlines into q## or
to q# #, depending on whether there's already a space or sentence-
terminal punctuation there.  That way your ( and ) motion-targets
still work.

Larry mentioned last week how often he uses \v now; it's certainly
useful. The new \v \V and \h and \H are rather nice, although I was
a bit surprised that \h didn't include the backspace character, and
that while "\t" is HT anywhere and "\b" is a BS in a string or
charclass but not regex, "\v" is only a VT in a regex or charclass,
not in a string.

Even more useful still is \R, I think, standing for "any return-
sequence".

   "\R" will atomically match a linebreak, including the
   network line-ending "\x0D\x0A".  Specifically, is exactly
   equivalent to

         (?>\x0D\x0A?|[\x0A-\x0C\x85\x{2028}\x{2029}])

   Note: "\R" has no special meaning inside of a character
   class; use "\v" instead (vertical whitespace).

It's a tiny pity that \v only works in patterns, though.  And
I have to admit that I tend to think of \R as the simpler-written:

     (?:\xOD\xOA|\v)

Hm, now I *am* now curious.  Why the no-backtracking (?>...|...) there?  
Is it the | with 0x0A on both sides?  Would \x0D\x0A?* work as well?

> to get really rid of them. Otherwise, the first time someone feeds
> your code a textfile that has been "converted" on Win32, your code
> will fail in interesting ways.

> Just one of these captchas that beginners only learn after painful
> experience :(

Hm.  Meaning "Captcha as in /(capture)/" || "Captcha as in Gotcha"?

Native files on their native systems aren't the problem, of course, 
due to the "\n" abstraction.  It's the alien ones you get where
cross-platform annoyances set it.  This is easily enough encountered,
though, whether from mail, samba, or NFS.

Somewhere I have a many-linked aaa2zzz program that looks at (aaa) 
and (zzz) in $0 to figure out which way you're going, then looks
those systems in a nice hash to find the line-endings.

I sure do run a lot of commands like this:

    perl    -pe 's/\n/\r\n/'    < README > README.txt
    perl -i -pe 's/\r//g'         README.txt
    perl -i -pe 'tr[\n\r][\r\n]'  Mac-README

Although now that I think about it, 

    perl -00 -i.pre-munge -pE 's/\R/\n/g' plaintextfile

might be better.  

And you can't set $/ to a regex, or else we could just use qr{\R}
and be done with it.  Instead, you have to sniff at the file a bit
to see what it feels like.  And even then, $/ only affects chomp and
readline, not . or ^ or $.

I'd rather like a line-discipline-sniffing module that did something
like /usr/bin/file does via /etc/magic, and lets you then binmode
and/or  $/-mangle appropriately.

Hey, if *that's* not a magic open (ie, using /etc/magic), I sure
dunno what is! :-)

I don't suppose anyone might know whether one exists already?

--tom


    % perl -wE 'say "\v"'
    Unrecognized escape \v passed through at -e line 1.
    v

    % perl -wE 'say chr(11) =~ /\v/'
    1

    % perl -wE 'say "\n" =~ /\v/'
    1

    % perl -wE 'say "\r" =~ /\v/'
    1

    % perl -wE 'say chr(0x2029) =~ /\v/'
    1

    % perl -Mcharnames=:full -wE \
        'say "\N{PARAGRAPH SEPARATOR}" =~ /\v/'
    1


Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About