develooper Front page | perl.perl5.porters | Postings from December 2010

Re: RFC: Summary of proposed handling of surrogates, non-characters,etc for 5.14. Note some backward incompatibility

Thread Previous | Thread Next
From:
karl williamson
Date:
December 20, 2010 18:16
Subject:
Re: RFC: Summary of proposed handling of surrogates, non-characters,etc for 5.14. Note some backward incompatibility
Message ID:
4D100DFF.6000202@khwilliamson.com
demerphq wrote:
> On 19 December 2010 17:39, Zefram <zefram@fysh.org> wrote:
>> karl williamson wrote:
>>>                    I also plan to remove the distinction of supers that
>>> don't fit into 31 bits from other supers.  I believe the size of a UV
>>> should be the determining factor in what is allowed.  A search of the
>>> p5p archives and git log did not turn up any reason for this distinction.
>> The reason is presumably that (real) UTF-8 can represent codepoints
>> up to 31 bits but not above.  Under the new paradigm, I think this
>> distinction should indeed not influence internal string processing.
>> The distinction is useful only when decoding/encoding UTF-8 for I/O.
>>
>>>              I know that Yves thinks that surrogates should warn when
>>> certain operations are performed on them; I'm not very opposed to that,
>>> but I don't think it's necessary,
>> I *am* very opposed to surrogate codepoints behaving differently from
>> non-surrogate codepoints under the allow-any-UV-codepoint paradigm.
> 
> Why? You cant just say "I'm opposed" without any justification. (Well
> you can - but don't expect it to be given any weight :-)
> 
> Why shouldn't perl warn when it tries to lc() a string containing a
> surrogate pair instead of the correctly decoded true codepoint the
> surrogate pair represents?
> 
> How is this any different from operations such as using non-numbers in
> mathematical operations? We default the non-number to 0, and warn. Why
> should lc() or other case aware operations be any different when
> operating on something that they simply should not be processing?
> 
> The perl rule is normally that we will provide the gun, and let you
> blow your foot off, but we warn that you are likely to need new shoes.
> Why do you think this one is "special" in this regard?
> 
> cheers,
> Yves
> 

I'm trying to make sense of all this.  No one is arguing that a 
surrogate pair read-in through UTF-16 should be other than the generated 
code point it represents.  Nor have I heard anyone argue that the 
default utf8 input layers should allow surrogates.  But there should be 
a way to specify that one wants to allow an isolated surrogate to be 
inputtable in utf8.

What people are saying, myself included, is that a surrogate stored 
inside a Perl scalar can be treated as any other non-assigned code 
point.  If adjacent to a surrogate of the opposite sex, it would not be 
treated as a surrogate pair.  Trying to convert it to UTF-16 would be an 
error.  Outputting it as UTF-8 would be a warning.  In earlier emails, I 
pointed out that the standard does say that surrogates can exist.  I'll 
repeat some of that text at the bottom of this message.  People want the 
ability to transfer and store arbitrary UV's using the principals of 
utf8, without them being necessarily Unicode.  I'm not averse to 
therefore warning on such characters when certain operations that give 
them Unicode semantics are performed, such as casing and /i matching. 
Zefram is; I'm not sure why.  I do believe that if above-Unicode code 
points don't warn, neither should surrogates, as above-Unicode code 
points have no Unicode semantics.

And, as I've pointed out, the standard does give semantics to every code 
point it recognizes, including the surrogates, and they actively manage 
the surrogates' properties.  Here are all the differences between 
unassigned, non-characters, and surrogates: non-characters are simply 
unassigneds with an additional property, NonCharacter, being true 
(though some are in blocks),   Surrogates have the identical properties 
as typical unassigneds with a few exceptions.  (Various unassigneds that 
are in, say, a Hebrew block, have Right-to-Left ordering, and aren't 
typical.  I'm talking about unassigneds whose ultimate purpose has yet 
to be determined.)  Surrogates have Gc=Cs instead of Gc=Cn, and they are 
in various blocks instead of Noblock, and tellingly, they have their own 
LineBreak property.  If Unicode didn't think there could be surrogates 
in strings, why did they make a special 'Surrogate' line break property? 
  I'm pretty sure that every other property is the same.

It would be nice to be able to store any uv in utf8 without it 
necessarily being considered Unicode.  It does make sense though to warn 
when attempting to impute Unicode semantics to those characters that 
don't have any.  This would mean especially the above-Unicode code 
points.  It doesn't necessarily mean the surrogates, as Unicode does 
assign them semantics; and I believe that for all those operations that 
we can think of, those semantics are essentially inert.  They won't 
match any other character besides themselves under /i.  They don't 
change case, etc.  Thus allowing them to have their inert semantics 
doesn't do any harm, I believe; so I don't think warnings are necessary.

  Here is the standard's text again:

"A Unicode string data type is simply an ordered sequence of code units. 
Thus a Unicode 8-bit string is an ordered sequence of 8-bit code units, 
a Unicode 16-bit string is an ordered sequence of 16-bit code units, and 
a Unicode 32-bit string is an ordered sequence of 32-bit code units.

"Depending on the programming environment, a Unicode string may or may 
not be required to be in the corresponding Unicode encoding form. For 
example, strings in Java, C#, or ECMAScript are Unicode 16-bit strings, 
but are not necessarily well-formed UTF-16 sequences. In normal 
processing, it can be far more efficient to allow such strings to 
contain code unit sequences that are not well-formed UTF-16—that is, 
isolated surrogates.  Because strings are such a fundamental component 
of every program, checking for isolated surrogates in every operation 
that modifies strings can create significant overhead, especially 
because supplementary characters are extremely rare as a percentage of 
overall text in programs worldwide.

"It is straightforward to design basic string manipulation libraries 
that handle isolated surrogates in a consistent and straightforward 
manner. They cannot ever be interpreted as abstract characters, but they 
can be internally handled the same way as noncharacters where they 
occur. Typically they occur only ephemerally, such as in dealing with 
keyboard events. While an ideal protocol would allow keyboard events to 
contain complete strings, many allow only a single UTF-16 code unit per 
event. As a sequence of events is transmitted to the application, a 
string that is being built up by the application in response to those 
events may contain isolated surrogates at any particular point in time."


Thread Previous | Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About