Front page | perl.perl5.porters |
Postings from July 2013
RFC: $/="\R"; perl -0R
Thread Next
From:
Karl Williamson
Date:
July 15, 2013 15:36
Subject:
RFC: $/="\R"; perl -0R
Message ID:
51E415AC.6020509@khwilliamson.com
Perl is in effective violation of a basic Unicode requirement, RL1.6, in
http://www.unicode.org/reports/tr18/#Line_Boundaries
============
Line Boundaries
To meet this requirement, if an implementation provides for
line-boundary testing, it shall recognize not only CRLF, LF, CR, but
also NEL (U+0085), PS (U+2029) and LS (U+2028).
Formfeed (U+000C) also normally indicates an end-of-line.
=============
(Note that tr18 is not actually a part of the Standard, but is written
as if it were, and Perl has tried to follow it.)
A program can slurp in a whole file and use "split /\R/" as a
work-around for not supporting this, but this may require too much memory.
I propose to allow $/ to be settable to a special value that indicates
to use the official Unicode record separators.
On the command line, the special value 'R' following the digit -0 (-0R)
would indicate this. This would remove the possibility of ever having a
-R option on the command line.
Setting
$/ = "\R";
would be the programmatic way of doing this. This, however, is legal to
do now, setting the separator to a capital letter R, but it raises a
warning: "Unrecognized escape \R passed through".
I have looked some at the code involved. A user specifying this would
encounter much slower input speeds, but this is much better than to not
be able to do it at all.
One argument against this is that it would lead to people demanding us
to set $/ to arbitrary patterns. But, we are doing this only to meet
Unicode's specs, and so could easily resist those entreaties.
Thread Next
-
RFC: $/="\R"; perl -0R
by Karl Williamson