develooper Front page | perl.perl5.porters | Postings from October 2009

Re: [perl #38293: chr(65535) should be allowed in regexes; was Re: [perl #69414] Case-insensitive utf8 matching problem

Thread Previous | Thread Next
From:
demerphq
Date:
October 5, 2009 14:06
Subject:
Re: [perl #38293: chr(65535) should be allowed in regexes; was Re: [perl #69414] Case-insensitive utf8 matching problem
Message ID:
9b18b3110910051358g7c34fc38l9d2c2236c8f823f@mail.gmail.com
2009/10/5 John Gardiner Myers <jgmyers@proofpoint.com>:
> karl williamson wrote:
>>
>> 3) code points above the legal Unicode maximum 10FFFF (which they have
>> recently reaffirmed will NEVER be exceeded (in 5.2, just released, 22%
>> of the available code points are assigned, up from 21% in 5.2)).
>>
>> 4) surrogates code points
>>
>> Case 3) could be construed as non-characters, but are somewhat
>> different because they aren't legal Unicode code points.  The message
>> could be reworded slightly to include them, as the principal is the same,
>> namely that these can successfully be used internally in an application, but
>> shouldn't be used for interchange with an unsuspecting application.  But
>> actually, I would prefer adding a new message for these, as there could be
>> less restriction on them, as there isn't the possibility of confusion with
>> BOM or other things.
>>
>> Case 4) has a separate message "UTF-16 surrogate 0x%04".  I think that
>> these actually could also be used internally in an application like the
>> others.  But this would definitely be an extension of Unicode, and require
>> some more work, and so I don't advocate it.
>
> Cases (3) and (4), when encoded in UTF-8, result in ill-formed code unit
> sequences (See definitions D92 and D93 in the Unicode Standard, version
> 5.2).  Generating such ill-formed code unit sequences violates conformance
> requirement C9 of the Unicode Standard.  Interpreting such ill-formed code
> unit sequences as characters violates conformance requirement C10 of the
> Unicode Standard.
>
> So Perl's existing behavior of merely warning on such "code points" does not
> conform to the Unicode Standard.

I think the language lawyers worked around this by saying that perl
internally does "utf8", which is basically "UTF-8" with some rules
relaxed and a larger range of "code points". IIRUC certain points of
the system insist on operating on only UTF-8, such as to the best of
my knowledge the conversion layer in Encode.pm does provide user
selectable levels of error trapping of such sequences.

Yves




-- 
perl -Mre=debug -e "/just|another|perl|hacker/"

Thread Previous | Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About