develooper Front page | perl.perl5.porters | Postings from March 2012

Re: [perl #107008] UTF8 patches for 5.16

Thread Previous | Thread Next
Karl Williamson
March 24, 2012 14:28
Re: [perl #107008] UTF8 patches for 5.16
Message ID:
On 03/24/2012 02:27 PM, Father Chrysostomos via RT wrote:
> On Sat Mar 24 12:40:24 2012, wrote:
>> On 03/23/2012 03:51 PM, Father Chrysostomos via RT wrote:
>>> Karl Williamson: You have some comments at the end of ticket perl #73022
>>> that imply that you are/were working on this bug.  Can you tell us what
>>> the status is?
>> What I was referring to was not the overall bug, but that Abigail
>> persuaded me that we should restrict the user-defined aliases in
>> "\N{...}" to begin with letters.  I did add code to toke.c to do this
>> (beginning in today's blead at line #3363).
>> However, that code doesn't check for above-Latin1 characters, as until
>> this patch is applied, it doesn't matter.  If we apply this patch in
>> 5.16, we need to revisit what we accept as characters in a name (I
>> personally have learned some things about Unicode since then, for
>> example, and also need to refresh my memory about this issue), and patch
>> this code as well.
>> The two-release deprecation cycle ends with 5.16, so that 5.18 can
>> actually forbid such names.
>> Given these reasons, I think it advisable to wait until 5.18 to fix this
>> bug.
>>    More importantly, can you tell us whether you think the
>>> patch attached (and at
>>> <>) is
>>> appropriate?
>> I haven't been following the design of these fixes, so I don't
>> understand (without more effort) how the patch works.  Otherwise, it
>> looks good to me; I imagine you would be qualified to immediately
>> determine if it looks like the similar patches that have been applied.
>> I would like a test added where the requested name has not been defined,
>> so that we could verify that the error message that gets output looks
>> sane with above-Latin1 characters.
> What the patch does is stop
> use utf8;
> $foo = "\N{ÿ}";
> from being interpreted as "\N{ÿ}", where the former is \xff in a UTF-8
> source file, and the latter is the UTF-8 octet sequence for \xff
> interpreted as Latin-1.

That sounds reasonable.

> If there are to be more changes later to \N{...}, I don’t know that it’s
> so necessary to include this patch now.

I'm saying we *should not* include it until we have made the changes to 
\N{} that restrict the characters used in the name to be legitimate 
ones.  Otherwise, we have a back compat problem when we do make those 
changes.  Since none of these can work now, there isn't an issue until 
this patch is applied.

However, a simple change for 5.16 to accommodate this patch could be to 
just forbid explicitly all above-Latin1 characters.  Code exists 
currently to check the Latin1 characters even in UTF-8 (though it may 
never have been tested because of this bug).  In a later release we 
could relax this requirement.

I'm willing to make this change and test if it is deemed desirable in 5.16.

The rule for Latin1 characters is that a name must begin with an 
alphabetic, and contain only \w plus space, no-break space, parentheses 
(because of existing Unicode names), and colons (because the name could 
be of the form: 'Greek: alpha').

Thread Previous | Thread Next Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at | Group listing | About