Moin,
On Saturday 22 September 2007 23:55:20 Zefram wrote:
> # New Ticket Created by Zefram
[snip]
>
> $ perl -we '$a="require x\x{f1}y::z"; eval $a; print $@'
> Warning: Use of "require" without parentheses is ambiguous at (eval 1)
> line 1. Unrecognized character \xF1 at (eval 1) line 1.
> $ perl -we '$a="require x\x{f1}y::z"; utf8::upgrade($a); eval $a; print
> $@' Can't locate xZZy/z.pm in @INC (@INC contains: /etc/perl
> /usr/local/lib/perl/5.8.8 /usr/local/share/perl/5.8.8 /usr/lib/perl5
> /usr/share/perl5 /usr/lib/perl/5.8 /usr/share/perl/5.8
> /usr/local/lib/site_perl /usr/local/lib/perl/5.8.4
> /usr/local/share/perl/5.8.4 .) at (eval 1) line 3. $
>
> What I show above as "ZZ" was originally a sequence of two non-ASCII
> characters: U+00c3 (Latin capital letter A with tilde) and U+00b1
> (plus-minus sign). I've replaced them with ASCII characters to avoid
> unpredictable manglement.
The sequence C3B1 is UTF-8 for "character 0xf1" so that is right.
> The phenomenon we see here is that the syntax of Perl, as judged by
> eval(), varies according to whether the input string is physically
> encoded in UTF8. If it is so encoded then U+00f1, Latin small letter N
> with tilde, is an acceptable identifier character, and so can be part
> of a module name. If not, then the very same character is invalid in
> that context and causes a syntax error.
>
> What, exactly, is Perl's identifier syntax? Is U+00f1 a valid identifier
> character?
When you don't do "use utf8;" you script is expected to be in latin1
(iso.-8859-1). (we leave "use locale" out of this for now). Under use utf8,
it can contain any UTF-8.
However, it seems eval() (or require?) doesn't know about this. Plus, I am
not entirely sure how much Unicode you can use in identifiers as something
like this:
#!perl
use utf8;
my $€ = 1;
still fails to compile with:
Unrecognized character \x82 at t.pl line 5.
perldoc perlsyn (in 5.8.8) doesn't seem to say anything about identifiers.
perldoc utf8 says:
Enabling the "utf8" pragma has the following effect:
Bytes in the source text that have their high‐bit set will be
treated as being part of a literal UTF−8 character. This
includes most literals such as identifier names, string
constants, and constant regular expression patterns.
But it doesn't seem to work in v5.8.8 at least.
All the best,
Tels
--
Signed on Sun Sep 23 18:05:15 2007 with key 0x93B84C15.
Get one of my photo posters: http://bloodgate.com/posters
PGP key on http://bloodgate.com/tels.asc or per email.
"Spammed if you do, spammed if you don't."
-- Murphy's Law
Thread Previous
|
Thread Next