Front page | perl.perl5.porters |
Postings from July 2008
Re: cross-platform line-endings (was: Removing files called minus)
From:
Abigail
Date:
July 30, 2008 13:36
Subject:
Re: cross-platform line-endings (was: Removing files called minus)
Message ID:
20080730203620.GD29536@almanda
On Wed, Jul 30, 2008 at 02:12:26PM -0600, Tom Christiansen wrote:
> In-Reply-To: Message from Tels <nospam-abuse@bloodgate.com>
> of "Wed, 30 Jul 2008 20:04:58 +0200." <200807302005.05971@bloodgate.com>
>
> > I'd like to inject here that chomp() is not enough to get rid of
> > Windows or Macintosh-style newlines.
>
> Well, it depends, but that's a good injection.
>
> > So you want:
>
> > $_ =~ s/[\r\n]//g; # or whatever
>
> If you're working on text-records, perhaps with $/ = q##, I think
> you might prefer:
>
> s/[\r\n]+/ /g;
>
> So you still have a break between "foo\nbar" when it becomes "foo bar".
>
> The join algorithm in fmt or (n)vi (but not vim the dratteed!) is
> much more clever about whether it should change newlines into q## or
> to q# #, depending on whether there's already a space or sentence-
> terminal punctuation there. That way your ( and ) motion-targets
> still work.
>
> Larry mentioned last week how often he uses \v now; it's certainly
> useful. The new \v \V and \h and \H are rather nice, although I was
> a bit surprised that \h didn't include the backspace character, and
> that while "\t" is HT anywhere and "\b" is a BS in a string or
> charclass but not regex, "\v" is only a VT in a regex or charclass,
> not in a string.
That's because \v matches any vertical whitespace, not just a vertical tab:
$ perl -Mcharnames=viacode -E 'for (0 .. 0xFFFF) {
printf "0x%04X: %s\n", $_, charnames::viacode($_) if chr () =~ /[\v]/}'
0x000A: LINE FEED (LF)
0x000B: LINE TABULATION
0x000C: FORM FEED (FF)
0x000D: CARRIAGE RETURN (CR)
0x0085: NEXT LINE (NEL)
0x2028: LINE SEPARATOR
0x2029: PARAGRAPH SEPARATOR
$
You can't use \v in strings for the same reason you cannot use \d.
As for \h not including a backspace, Perl follows the Unicode standard
which doesn't consider the backspace to be horizontal whitespace.
> Even more useful still is \R, I think, standing for "any return-
> sequence".
>
> "\R" will atomically match a linebreak, including the
> network line-ending "\x0D\x0A". Specifically, is exactly
> equivalent to
>
> (?>\x0D\x0A?|[\x0A-\x0C\x85\x{2028}\x{2029}])
>
> Note: "\R" has no special meaning inside of a character
> class; use "\v" instead (vertical whitespace).
>
> It's a tiny pity that \v only works in patterns, though. And
> I have to admit that I tend to think of \R as the simpler-written:
>
> (?:\xOD\xOA|\v)
>
> Hm, now I *am* now curious. Why the no-backtracking (?>...|...) there?
> Is it the | with 0x0A on both sides? Would \x0D\x0A?* work as well?
No. \R will *not* match a single \x0D if it's followed by a \x0A. The
Unicode standard says that in such a case, \R must match \x0D\x0A.
$ perl -wE 'say "\x0D\x0A" =~ /\x0D\x0A?./s ? "match" : "no match"'
match
$ perl -wE 'say "\x0D\x0A" =~ /\R./s ? "match" : "no match"'
no match
$
Abigail