develooper Front page | perl.perl5.porters | Postings from July 2008

Re: cross-platform line-endings (was: Removing files called minus)

July 30, 2008 13:36
Re: cross-platform line-endings (was: Removing files called minus)
Message ID:
On Wed, Jul 30, 2008 at 02:12:26PM -0600, Tom Christiansen wrote:
> In-Reply-To: Message from Tels <> 
>    of "Wed, 30 Jul 2008 20:04:58 +0200." <> 
> > I'd like to inject here that chomp() is not enough to get rid of
> > Windows or Macintosh-style newlines.
> Well, it depends, but that's a good injection.
> > So you want:
> >       $_ =~ s/[\r\n]//g;      # or whatever
> If you're working on text-records, perhaps with $/ = q##, I think
> you might prefer:
>     s/[\r\n]+/ /g;
> So you still have a break between "foo\nbar" when it becomes "foo bar".
> The join algorithm in fmt or (n)vi (but not vim the dratteed!) is
> much more clever about whether it should change newlines into q## or
> to q# #, depending on whether there's already a space or sentence-
> terminal punctuation there.  That way your ( and ) motion-targets
> still work.
> Larry mentioned last week how often he uses \v now; it's certainly
> useful. The new \v \V and \h and \H are rather nice, although I was
> a bit surprised that \h didn't include the backspace character, and
> that while "\t" is HT anywhere and "\b" is a BS in a string or
> charclass but not regex, "\v" is only a VT in a regex or charclass,
> not in a string.

That's because \v matches any vertical whitespace, not just a vertical tab:

    $ perl -Mcharnames=viacode -E 'for (0 .. 0xFFFF) {
      printf "0x%04X: %s\n", $_, charnames::viacode($_) if chr () =~ /[\v]/}'
    0x000A: LINE FEED (LF)
    0x000C: FORM FEED (FF)
    0x0085: NEXT LINE (NEL)
    0x2028: LINE SEPARATOR

You can't use \v in strings for the same reason you cannot use \d.

As for \h not including a backspace, Perl follows the Unicode standard
which doesn't consider the backspace to be horizontal whitespace.

> Even more useful still is \R, I think, standing for "any return-
> sequence".
>    "\R" will atomically match a linebreak, including the
>    network line-ending "\x0D\x0A".  Specifically, is exactly
>    equivalent to
>          (?>\x0D\x0A?|[\x0A-\x0C\x85\x{2028}\x{2029}])
>    Note: "\R" has no special meaning inside of a character
>    class; use "\v" instead (vertical whitespace).
> It's a tiny pity that \v only works in patterns, though.  And
> I have to admit that I tend to think of \R as the simpler-written:
>      (?:\xOD\xOA|\v)
> Hm, now I *am* now curious.  Why the no-backtracking (?>...|...) there?  
> Is it the | with 0x0A on both sides?  Would \x0D\x0A?* work as well?

No. \R will *not* match a single \x0D if it's followed by a \x0A. The 
Unicode standard says that in such a case, \R must match \x0D\x0A.

  $ perl -wE 'say "\x0D\x0A" =~ /\x0D\x0A?./s ? "match" : "no match"'
  $ perl -wE 'say "\x0D\x0A" =~ /\R./s ? "match" : "no match"'
  no match

Abigail Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at | Group listing | About