Front page | perl.perl5.porters |
Postings from October 2011
Re: We need to consider Unicode variable names for 5.16
Thread Previous
|
Thread Next
From:
Brian Fraser
Date:
October 10, 2011 03:02
Subject:
Re: We need to consider Unicode variable names for 5.16
Message ID:
CA+nL+nbc1PTQEyzj1m7fY9HzmRSr-KRVbmHP+ePm0OiXi6Jsfw@mail.gmail.com
On Sun, Oct 9, 2011 at 12:40 AM, Karl Williamson <public@khwilliamson.com>wrote:
> On 10/06/2011 08:51 PM, Brian Fraser wrote:
>
> Basically, as of right now in blead, variables of length one match
>> (?&sigil) \p{Any} (?=\z) instead of (?&sigil) \C (?=\z). Do we really
>> want \p{Any}, which allows a whole bunch of problematic characters? If
>> not, what do we restrict it to?
>>
>
> I haven't thought about this very much, but I don't want it to get
> warnocked. At first blush, certainly code points with a category of C, M,
> or Z should be excluded.
>
>
Tom had proposed something like [\pC\pZ\p{BC=NSM}] in, uh, a thread that I
can't quite place right now. But to make things worse, keep in mind that
(?&sigil)(?=\C\z)\p{Latin_1} has been legal since forever, so long as you
aren't under a use utf8. And I'm certain that I've seen things like
$\N{POUND SIGN} used out there -- How do we deal with _that_ little wart?
> But now that we are close to having utf8 actually work in source code, we
> should think about identifiers in general, not just single character ones.
> This is actually a remarkably complicated issue, discussed at length in
> http://www.unicode.org/reports/tr31/
>
>
The outcome of this thread should be enough to get R1 conformance, if
anything. I don't think we want R1a at all; Meanwhile R1b is certainly
doable, and it might put some compatibility concerns to rest.
> There are two very different approaches discussed there. There are
> security concerns, which are alluded to in the document; for example do we
> allow identifiers to contain characters from multiple scripts?
That boat has unfortunately already sailed. Perl has allowed basically any
combination of scripts that comes to mind for a while, and it seems silly to
disallow things like $Gordian_\N{ROMAN NUMERAL TWO}; Admittedly there's
recommended combinations in UTR#36, but I'm of mind that something like this
belongs more in Perl::Critic than in the core.
Also, as it stands, I don't think there's anything that can be done for RTL
scripts; they'll have to cope with being displayed completely wrong. Maybe
even add a "Bidi control characters not allowed outside of string
literals"-style error, to make it explicit.
I imagine a source filter + PerlIO::via layer combo might be enough to give
the impression of at least handling them, if there's interest in that.
> What about normalization, etc.
>
>
https://rt.perl.org:443/rt3//Public/Bug/Display.html?id=96814
I had a branch somewhere that implemented a very simple pragma (use
normalized identifiers => "NFD"), which started off pretty well, but I
quickly got stumped as I couldn't get direct stash manipulation right in all
cases.
That "worked" (for some definitions of working, of course :P) by calling
Unicode::Normalize::normalize, but a final implementation should probably
inline some normalization form into the core, and use
Unicode::Normalize::normalize as a fallback.
But in any case, opt-in normalization, and worse, opt-in normalization where
you pick the form, has some serious drawbacks. If a module picks NFKC and
exports a function into a package where I picked NFD, then what? Actually,
it's probably worse than that! If package main has one form and the first
part of your package can be normalized in different ways, you might end up
with a ghost package of sorts -- Literals inside your package and what's in
%:: would differ. No clue what would happen then.
Maybe defer the choosing of the normalization form to the calling packages,
and pick a default if there isn't any? Not that that isn't troublesome, but
at this point I'm just grasping at straws.
"...This is hard, let's go shopping."
I wonder if having hooks into glob/lexical creation/fetching isn't a
completely insane idea; You could then force some of these issues into a
module, as well as installing the machinery required to get UAX#31 R4 to 7
comformance through CPAN.
We need to have a plan and some restrictions before 5.16 is released.
>
Thread Previous
|
Thread Next