develooper Front page | perl.perl5.porters | Postings from August 2012

Utf8 whitespace in numification treated differently to asciiwhitespace. ( Was: Newline legacy )

From:
Kent Fredric
Date:
August 29, 2012 05:22
Subject:
Utf8 whitespace in numification treated differently to asciiwhitespace. ( Was: Newline legacy )
Message ID:
CAATnKFDdy4Q3Mbgk+ABxoUXY74KGOu_Awp-FdBetOhRy20NeWg@mail.gmail.com
On 21 August 2012 07:32, Jan Dubois <jand@activestate.com> wrote:
> Given that even "1a" is considered to be the same as "1" (numerically),
> it totally makes sense that your constructed string is numerically equal
> as well.

I don't think this is strictly true:

PERL_STRICTURES_EXTRA=1 perl -Mstrictures -E 'my $i = 1; my $j = "1a";
if( $i == $j ) { say "Same" } else { say "Different" }'
Argument "1a" isn't numeric in numeric eq (==) at -e line 1.

But it can be slightly surprising that 1 == "   1\n"

I know this is a deviation of sorts from $TOPIC , but it seems
"sensible" to me that there would at least be a warnings category /
pragma that is slightly stricter with regards to numification.

The reason is, in my mind, because "some types of whitespace are
ignored" in numfication seems sort of magical. And the question then
became "oh really, what types of whitespace are ignored"?

0x09 : Horizontal Tab
0x0A : Newline ( \r )
0x0C : Form Feed
0x0D : Carriage return ( \r )
0x20 : Space.

That sounds like a dandy starting list, which is perhaps useful if
you're only dealing with Ascii data.

However, utf8 ...

perl -Mutf8::all -Mwarnings=all -E 'binmode *STDOUT, q{:utf8}; for my
$i ( 0 .. 0xFFFF ){ next unless chr($i) =~ /\s/; say $i . q[ is] . ( (
( chr($i) . q{1} ) == 1 ) ? q[] : q[ not] ) . q[ numeric friendly
whitespace ]  }'
9 is numeric friendly whitespace
10 is numeric friendly whitespace
12 is numeric friendly whitespace
13 is numeric friendly whitespace
32 is numeric friendly whitespace
Argument "M-^E1" isn't numeric in numeric eq (==) at -e line 1.
133 is not numeric friendly whitespace
Argument "M- 1" isn't numeric in numeric eq (==) at -e line 1.
160 is not numeric friendly whitespace
Argument "\x{1680}1" isn't numeric in numeric eq (==) at -e line 1.
5760 is not numeric friendly whitespace
Argument "\x{180e}1" isn't numeric in numeric eq (==) at -e line 1.
6158 is not numeric friendly whitespace
Argument "\x{2000}1" isn't numeric in numeric eq (==) at -e line 1.
8192 is not numeric friendly whitespace
Argument "\x{2001}1" isn't numeric in numeric eq (==) at -e line 1.
8193 is not numeric friendly whitespace
Argument "\x{2002}1" isn't numeric in numeric eq (==) at -e line 1.
8194 is not numeric friendly whitespace
Argument "\x{2003}1" isn't numeric in numeric eq (==) at -e line 1.
8195 is not numeric friendly whitespace
Argument "\x{2004}1" isn't numeric in numeric eq (==) at -e line 1.
8196 is not numeric friendly whitespace
Argument "\x{2005}1" isn't numeric in numeric eq (==) at -e line 1.
8197 is not numeric friendly whitespace
Argument "\x{2006}1" isn't numeric in numeric eq (==) at -e line 1.
8198 is not numeric friendly whitespace
Argument "\x{2007}1" isn't numeric in numeric eq (==) at -e line 1.
8199 is not numeric friendly whitespace
Argument "\x{2008}1" isn't numeric in numeric eq (==) at -e line 1.
8200 is not numeric friendly whitespace
Argument "\x{2009}1" isn't numeric in numeric eq (==) at -e line 1.
8201 is not numeric friendly whitespace
Argument "\x{200a}1" isn't numeric in numeric eq (==) at -e line 1.
8202 is not numeric friendly whitespace
Argument "\x{2028}1" isn't numeric in numeric eq (==) at -e line 1.
8232 is not numeric friendly whitespace
Argument "\x{2029}1" isn't numeric in numeric eq (==) at -e line 1.
8233 is not numeric friendly whitespace
Argument "\x{202f}1" isn't numeric in numeric eq (==) at -e line 1.
8239 is not numeric friendly whitespace
Argument "\x{205f}1" isn't numeric in numeric eq (==) at -e line 1.
8287 is not numeric friendly whitespace
Argument "\x{3000}1" isn't numeric in numeric eq (==) at -e line 1.
12288 is not numeric friendly whitespace


I had somewhat expected 0xA0 , nonbreaking space, to be deemed 'whitespace' ,

So essentially, some types of whitespace magically being ignored
during numifcation, and others not, strikes me as something of, at
very least, an inconsistency, and at worst, hidden magical behaviour
that might bite somebody somewhere.

Granted, I don't have any real-world scenario examples of why this
matters, just the feeling that "it could matter" and if it does
matter, I probably wont know until too late, and not have an easy
solution.

SUMMARISED:

chr(0x20) . 1   # a number
chr(0xA0) . 1 # not a number

Strange magical treament of 1 type of whitespace as "ok" ==> unexpected.

Would like:
* warnings when this happens ( though I'm not sure if the warnings
should be opt-in, or opt-out by default )
* perhaps an extension to the UTF8 handling code so UTF8 whitespace is
also ignored ( which seems like a sane choice for the default
behaviour given by 'use utf8::all' at the very least )



-- 
Kent

perl -e  "print substr( \"edrgmaM  SPA NOcomil.ic\\@tfrken\", \$_ * 3,
3 ) for (9,8,0,7,1,6,5,4,3,2 );"



nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About