develooper Front page | perl.perl5.porters | Postings from December 2012

Re: [perl #109828] PerlIO::scalar does not handle UTF-8

Thread Previous | Thread Next
Tony Cook
December 31, 2012 08:00
Re: [perl #109828] PerlIO::scalar does not handle UTF-8
Message ID:
On Fri, Dec 28, 2012 at 11:16:36PM +0100, Leon Timmermans wrote:
> On Fri, Dec 28, 2012 at 11:06 PM, Tony Cook via RT
> <> wrote:
> > It should fail to open.  If you open a UTF8 flagged string for append
> > and write non-UTF8 bytes you will produce an invalidly encoded SvUTF8
> > string.
> >
> > Your patch as written ignores the principle that the SvUTF8() flag only
> > controls the internal encoding, not other behaviour.  If the SV contains
> > only code point 0xFF or lower we should downgrade it and work with that
> > rather than failing (or producing a warning).
> I didn't see enough consensus to change it that much, but I would be in favor.
> > This should also be done for _read() and _write(), since the SV can be
> > modified between I/O operations.
> >
> > There's an unrelated problem that _pushed() checks flags on both arg and
> > SvRV(arg) without calling SvGETMAGIC().
> It should just stop peeking and poking into the SV altogether, and use
> the proper APIs (sv_insert and friends). For that matter, I sometimes
> feel like it should be rewritten from scratch to actually make sense.
> Pretty much all of it is problematic.

I've attached my suggested changes (in several parts), also available
on as tonyc/perlio-scalar-sanity.

Reasons for failing instead of warning:

1) reading - to follow the "SVf_UTF8 is only representation"
principle, we'd need to download where possible, so a \xA1 (for
example) in the stream is always treated as that byte, but this means
we have an inconsistency when the scalar cannot be downgraded - the
first bytes of the character sequences "\xA1\x40" and "\xA1\x{101}"
would be different.

2) writing - if the SV is flagged UTF8, and the user of the handle
doesn't write correct UTF8 data at the correct offsets, the SV will no
longer be properly formed utf-8, which I believe we're trying to
maintain.  One of my tests produced a warning about invalid UTF-8
during before the fix was applied.

It's possible could be avoided if we always treat the written bytes as
code points and upgrade them when writing to a UTF8 string, but then
we run into a consitency issue vs reading - what happens when a read
on a UTF8 string reaches a code point > 0xFF?

As written I think the warning message could be improved and the
documentation of the warning could be improved.


Thread Previous | Thread Next Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at | Group listing | About