develooper Front page | perl.perl5.porters | Postings from July 2013

RFC: $/="\R"; perl -0R

Thread Next
From:
Karl Williamson
Date:
July 15, 2013 15:36
Subject:
RFC: $/="\R"; perl -0R
Message ID:
51E415AC.6020509@khwilliamson.com
Perl is in effective violation of a basic Unicode requirement, RL1.6, in
http://www.unicode.org/reports/tr18/#Line_Boundaries

============
Line Boundaries

To meet this requirement, if an implementation provides for 
line-boundary testing, it shall recognize not only CRLF, LF, CR, but 
also NEL (U+0085), PS (U+2029) and LS (U+2028).

Formfeed (U+000C) also normally indicates an end-of-line.
=============

(Note that tr18 is not actually a part of the Standard, but is written 
as if it were, and Perl has tried to follow it.)

A program can slurp in a whole file and use "split /\R/" as a 
work-around for not supporting this, but this may require too much memory.

I propose to allow $/ to be settable to a special value that indicates 
to use the official Unicode record separators.

On the command line, the special value 'R' following the digit -0 (-0R) 
would indicate this.  This would remove the possibility of ever having a 
-R option on the command line.

Setting

	$/ = "\R";

would be the programmatic way of doing this.  This, however, is legal to 
do now, setting the separator to a capital letter R, but it raises a 
warning: "Unrecognized escape \R passed through".

I have looked some at the code involved.  A user specifying this would 
encounter much slower input speeds, but this is much better than to not 
be able to do it at all.

One argument against this is that it would lead to people demanding us 
to set $/ to arbitrary patterns.  But, we are doing this only to meet 
Unicode's specs, and so could easily resist those entreaties.

Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About