develooper Front page | perl.perl5.porters | Postings from June 2012

[perl #113856] Re: highly illegal variable names are now accidentally legal

From:
Brian Fraser
Date:
June 26, 2012 15:18
Subject:
[perl #113856] Re: highly illegal variable names are now accidentally legal
Message ID:
rt-3.6.HEAD-28836-1340749081-82.113856-75-0@perl.org
# New Ticket Created by  Brian Fraser 
# Please include the string:  [perl #113856]
# in the subject line of all future correspondence about this issue. 
# <URL: https://rt.perl.org:443/rt3/Ticket/Display.html?id=113856 >


On Wed, Jun 13, 2012 at 10:15 AM, Tom Christiansen <tchrist@perl.com> wrote:

> Call me a Luddite, but the following program is to my mind being very
> naughty -- which is the very *nicest* thing I can say about it.
>
>    use v5.16;
>    use utf8;
>
>    $— = "EM DASH";  # gc=Punctuation
>    say $—;
>
>    $ = "APPLE LOGO";  # Private Use Area
>    say $;
>
>    $£ = "POUND STERLING";  # gc=Symbol
>    say $£;
>
>    $­ = "SOFT HYPHEN" ; # gc=Control
>    say $­;
>
>    $ = "THIN SPACE";   # whitespace, can you believe it!?!?
>    say $ = "THIN SPACE";
>
>    $� = "HYPER 0x11_1111";  # trans-Unicode
>    say $�;
>
>    $� = "SURROGATE DC00"; # this should never be possible
>    say $�;
>
>    $̈̈ = "COMBINING DIARESIS";
>    say $̈̈ ;
>
>    $⃠  = "COMBINING ENCLOSING CIRCLE BACKSLASH";
>    say $⃠ ;
>
>    say "That’s all, folks!";
>
> Because it in fact compiles and runs.  Messily, yes, but it runs.
> I don't understand why it even compiles.
>
>    % ~/blead/perl -I ~/blead/lib /tmp/testu
>    Code point 0x111111 is not Unicode, all \p{} matches fail; all \P{}
> matches succeed at /tmp/testu line 20.
>    Code point 0x111111 is not Unicode, all \p{} matches fail; all \P{}
> matches succeed at /tmp/testu line 20.
>    Code point 0x111111 is not Unicode, all \p{} matches fail; all \P{}
> matches succeed at /tmp/testu line 20.
>    Code point 0x111111 is not Unicode, all \p{} matches fail; all \P{}
> matches succeed at /tmp/testu line 20.
>    Code point 0x111111 is not Unicode, all \p{} matches fail; all \P{}
> matches succeed at /tmp/testu line 21.
>    Code point 0x111111 is not Unicode, all \p{} matches fail; all \P{}
> matches succeed at /tmp/testu line 21.
>    Code point 0x111111 is not Unicode, all \p{} matches fail; all \P{}
> matches succeed at /tmp/testu line 21.
>    Code point 0x111111 is not Unicode, all \p{} matches fail; all \P{}
> matches succeed.
>    EM DASH
>    APPLE LOGO
>    POUND STERLING
>    SOFT HYPHEN
>    THIN SPACE
>    HYPER 0x11_1111
>    SURROGATE DC00
>    COMBINING DIARESIS
>    COMBINING ENCLOSING CIRCLE BACKSLASH
>    That’s all, folks!
>
> % blead -v says that
>
>   This is perl 5, version 17, subversion 0 (v5.17.0-352-g3630f57) built
> for darwin-2level
>
> I can handle punctuation.  I can handle symbols.
>
> I can even handle private use area.
>
> I don't know what I think about hypers.  Probably yes ok.
>
> But I see no place for control characters or combining marks of any sort,
> and I am really unhappy about whitespace variable names.  What's next,
> dollar tab?  Beyond that, I am *exceedingly* displeased with surrogates.
> That's just evil and  wrong, and in so many ways.
>
> --tom
>
> Lest there be any question, here is a verbosely uniquoted version:
>
>     1\t
>     2\tuse v5.16;
>     3\tuse utf8;
>     4\t
>     5\t$\N{EM DASH} = "EM DASH";  # gc=Punctuation
>     6\tsay $\N{EM DASH};
>     7\t
>     8\t$\N{U+F8FF} = "APPLE LOGO";  # Private Use Area
>     9\tsay $\N{U+F8FF};
>    10\t
>    11\t$\N{POUND SIGN} = "POUND STERLING";  # gc=Symbol
>    12\tsay $\N{POUND SIGN};
>    13\t
>    14\t$\N{SOFT HYPHEN} = "SOFT HYPHEN" ; # gc=Control
>    15\tsay $\N{SOFT HYPHEN};
>    16\t
>    17\t$\N{THIN SPACE} = "THIN SPACE";   # whitespace, can you believe
> it!?!?
>    18\tsay $\N{THIN SPACE} = "THIN SPACE";
>    19\t
>    20\t$\N{U+111111} = "HYPER 0x11_1111";  # trans-Unicode
>    21\tsay $\N{U+111111};
>    22\t
>    23\t$\N{U+DC00} = "SURROGATE DC00"; # this should never be possible
>    24\tsay $\N{U+DC00};
>    25\t
>    26\t$\N{COMBINING DIAERESIS}\N{COMBINING DIAERESIS} = "COMBINING
> DIARESIS";
>    27\tsay $\N{COMBINING DIAERESIS}\N{COMBINING DIAERESIS} ;
>    28\t
>    29\t$\N{COMBINING ENCLOSING CIRCLE BACKSLASH}  = "COMBINING ENCLOSING
> CIRCLE BACKSLASH";
>    30\tsay $\N{COMBINING ENCLOSING CIRCLE BACKSLASH} ;
>    31\t
>    32\tsay "That\N{RIGHT SINGLE QUOTATION MARK}s all, folks!";
>

https://github.com/Hugmeir/utf8mess/tree/restrict_variable_names

So, I've taken a few liberties implementing this. Here's the executive
summary of the branch:
Length-one variables must match (?: (?=Word) [\p{XIDS}_] |
[\p{POSIX_Punct}\p{POSIX_Digit}\p{POSIX_Cntrl}] ). This is irregardless of
whenever 'use utf8;' is in effect, so $£ is now always illegal, though
expanding this to use the some broader definition of punctuation/controls
should be simple, it's just changing one macro.
And like mentioned before, valid characters in an identifier no longer vary
depending on 'use utf8', except for the obvious restriction that under 'no
utf8;' the characters belong solely to the Latin-1 range. pod/perldata.pod
has a section streamlining the rules. As a side effect, 'no utf8; use
strict; $à' now has to declare $à with my(), as it well should.
The branch also fixes a bug in word and identifier parsing, where ASCII
alphanumerics would be eaten up without checking if the next character
matched \p{XIDC}. This lead to qq\N{MIDDLE DOT} test \N{MIDDLE DOT} to work
in previous versions, but MIDDLE DOT is an XIDC character, so now that's
parsed as bareword( qq\N{MIDDLE DOT} ), bareword( test ), ???? XIDC
character on it's own, syntax error. To get the previous behavior, you need
a space before the delimiter, which is consistent with how 'q mfoom' works.

Internally, three things might be sorta icky and really need someone to
look them over; First, I changed the definition of isIDFIRST_lazy_if and
isALNUM_lazy_if to use isIDFIRST_L1(*s) and (isALNUMC_L1(*s) || *s == '_'),
respectively, if we aren't under UTF mode.
Second, to fix the "ascii letters being consumed too early" bug above, I
had to turn around how scan_ident and scan_word work, by putting the UTF
case first. This probably leads to some slowdowns.
Third, I've changed several spots from using isALNUM_lazy_if to
isIDFIRST_lazy_if -- This made sense to me at the time, but an extra pair
of eyes would be welcome.



nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About