Front page | perl.perl5.porters |
Postings from December 2010
Re: RFC: Summary of proposed handling of surrogates, non-characters,etc for 5.14. Note some backward incompatibility
Thread Previous
|
Thread Next
From:
karl williamson
Date:
December 20, 2010 18:16
Subject:
Re: RFC: Summary of proposed handling of surrogates, non-characters,etc for 5.14. Note some backward incompatibility
Message ID:
4D100DFF.6000202@khwilliamson.com
demerphq wrote:
> On 19 December 2010 17:39, Zefram <zefram@fysh.org> wrote:
>> karl williamson wrote:
>>> I also plan to remove the distinction of supers that
>>> don't fit into 31 bits from other supers. I believe the size of a UV
>>> should be the determining factor in what is allowed. A search of the
>>> p5p archives and git log did not turn up any reason for this distinction.
>> The reason is presumably that (real) UTF-8 can represent codepoints
>> up to 31 bits but not above. Under the new paradigm, I think this
>> distinction should indeed not influence internal string processing.
>> The distinction is useful only when decoding/encoding UTF-8 for I/O.
>>
>>> I know that Yves thinks that surrogates should warn when
>>> certain operations are performed on them; I'm not very opposed to that,
>>> but I don't think it's necessary,
>> I *am* very opposed to surrogate codepoints behaving differently from
>> non-surrogate codepoints under the allow-any-UV-codepoint paradigm.
>
> Why? You cant just say "I'm opposed" without any justification. (Well
> you can - but don't expect it to be given any weight :-)
>
> Why shouldn't perl warn when it tries to lc() a string containing a
> surrogate pair instead of the correctly decoded true codepoint the
> surrogate pair represents?
>
> How is this any different from operations such as using non-numbers in
> mathematical operations? We default the non-number to 0, and warn. Why
> should lc() or other case aware operations be any different when
> operating on something that they simply should not be processing?
>
> The perl rule is normally that we will provide the gun, and let you
> blow your foot off, but we warn that you are likely to need new shoes.
> Why do you think this one is "special" in this regard?
>
> cheers,
> Yves
>
I'm trying to make sense of all this. No one is arguing that a
surrogate pair read-in through UTF-16 should be other than the generated
code point it represents. Nor have I heard anyone argue that the
default utf8 input layers should allow surrogates. But there should be
a way to specify that one wants to allow an isolated surrogate to be
inputtable in utf8.
What people are saying, myself included, is that a surrogate stored
inside a Perl scalar can be treated as any other non-assigned code
point. If adjacent to a surrogate of the opposite sex, it would not be
treated as a surrogate pair. Trying to convert it to UTF-16 would be an
error. Outputting it as UTF-8 would be a warning. In earlier emails, I
pointed out that the standard does say that surrogates can exist. I'll
repeat some of that text at the bottom of this message. People want the
ability to transfer and store arbitrary UV's using the principals of
utf8, without them being necessarily Unicode. I'm not averse to
therefore warning on such characters when certain operations that give
them Unicode semantics are performed, such as casing and /i matching.
Zefram is; I'm not sure why. I do believe that if above-Unicode code
points don't warn, neither should surrogates, as above-Unicode code
points have no Unicode semantics.
And, as I've pointed out, the standard does give semantics to every code
point it recognizes, including the surrogates, and they actively manage
the surrogates' properties. Here are all the differences between
unassigned, non-characters, and surrogates: non-characters are simply
unassigneds with an additional property, NonCharacter, being true
(though some are in blocks), Surrogates have the identical properties
as typical unassigneds with a few exceptions. (Various unassigneds that
are in, say, a Hebrew block, have Right-to-Left ordering, and aren't
typical. I'm talking about unassigneds whose ultimate purpose has yet
to be determined.) Surrogates have Gc=Cs instead of Gc=Cn, and they are
in various blocks instead of Noblock, and tellingly, they have their own
LineBreak property. If Unicode didn't think there could be surrogates
in strings, why did they make a special 'Surrogate' line break property?
I'm pretty sure that every other property is the same.
It would be nice to be able to store any uv in utf8 without it
necessarily being considered Unicode. It does make sense though to warn
when attempting to impute Unicode semantics to those characters that
don't have any. This would mean especially the above-Unicode code
points. It doesn't necessarily mean the surrogates, as Unicode does
assign them semantics; and I believe that for all those operations that
we can think of, those semantics are essentially inert. They won't
match any other character besides themselves under /i. They don't
change case, etc. Thus allowing them to have their inert semantics
doesn't do any harm, I believe; so I don't think warnings are necessary.
Here is the standard's text again:
"A Unicode string data type is simply an ordered sequence of code units.
Thus a Unicode 8-bit string is an ordered sequence of 8-bit code units,
a Unicode 16-bit string is an ordered sequence of 16-bit code units, and
a Unicode 32-bit string is an ordered sequence of 32-bit code units.
"Depending on the programming environment, a Unicode string may or may
not be required to be in the corresponding Unicode encoding form. For
example, strings in Java, C#, or ECMAScript are Unicode 16-bit strings,
but are not necessarily well-formed UTF-16 sequences. In normal
processing, it can be far more efficient to allow such strings to
contain code unit sequences that are not well-formed UTF-16—that is,
isolated surrogates. Because strings are such a fundamental component
of every program, checking for isolated surrogates in every operation
that modifies strings can create significant overhead, especially
because supplementary characters are extremely rare as a percentage of
overall text in programs worldwide.
"It is straightforward to design basic string manipulation libraries
that handle isolated surrogates in a consistent and straightforward
manner. They cannot ever be interpreted as abstract characters, but they
can be internally handled the same way as noncharacters where they
occur. Typically they occur only ephemerally, such as in dealing with
keyboard events. While an ideal protocol would allow keyboard events to
contain complete strings, many allow only a single UTF-16 code unit per
event. As a sequence of events is transmitted to the application, a
string that is being built up by the application in response to those
events may contain isolated surrogates at any particular point in time."
Thread Previous
|
Thread Next