develooper Front page | perl.perl5.porters | Postings from January 2013

Re: [perl #109828] PerlIO::scalar does not handle UTF-8

Thread Previous | Thread Next
Karl Williamson
January 14, 2013 03:37
Re: [perl #109828] PerlIO::scalar does not handle UTF-8
Message ID:
On 12/31/2012 01:00 AM, Tony Cook wrote:
> On Fri, Dec 28, 2012 at 11:16:36PM +0100, Leon Timmermans wrote:
>> On Fri, Dec 28, 2012 at 11:06 PM, Tony Cook via RT
>> <> wrote:
>>> It should fail to open.  If you open a UTF8 flagged string for append
>>> and write non-UTF8 bytes you will produce an invalidly encoded SvUTF8
>>> string.
>>> Your patch as written ignores the principle that the SvUTF8() flag only
>>> controls the internal encoding, not other behaviour.  If the SV contains
>>> only code point 0xFF or lower we should downgrade it and work with that
>>> rather than failing (or producing a warning).
>> I didn't see enough consensus to change it that much, but I would be in favor.
>>> This should also be done for _read() and _write(), since the SV can be
>>> modified between I/O operations.
>>> There's an unrelated problem that _pushed() checks flags on both arg and
>>> SvRV(arg) without calling SvGETMAGIC().
>> It should just stop peeking and poking into the SV altogether, and use
>> the proper APIs (sv_insert and friends). For that matter, I sometimes
>> feel like it should be rewritten from scratch to actually make sense.
>> Pretty much all of it is problematic.
> I've attached my suggested changes (in several parts), also available
> on as tonyc/perlio-scalar-sanity.
> Reasons for failing instead of warning:
> 1) reading - to follow the "SVf_UTF8 is only representation"
> principle, we'd need to download where possible, so a \xA1 (for
> example) in the stream is always treated as that byte, but this means
> we have an inconsistency when the scalar cannot be downgraded - the
> first bytes of the character sequences "\xA1\x40" and "\xA1\x{101}"
> would be different.
> 2) writing - if the SV is flagged UTF8, and the user of the handle
> doesn't write correct UTF8 data at the correct offsets, the SV will no
> longer be properly formed utf-8, which I believe we're trying to
> maintain.  One of my tests produced a warning about invalid UTF-8
> during before the fix was applied.
> It's possible could be avoided if we always treat the written bytes as
> code points and upgrade them when writing to a UTF8 string, but then
> we run into a consitency issue vs reading - what happens when a read
> on a UTF8 string reaches a code point > 0xFF?
> As written I think the warning message could be improved and the
> documentation of the warning could be improved.
> Tony

Attached are some suggestions for wording changes.  I've never liked our 
distinction between bytes and character semantics.  It makes no sense to 
me.  Everything is ultimately a byte.

Thread Previous | Thread Next Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at | Group listing | About