On 12/31/2012 01:00 AM, Tony Cook wrote: > On Fri, Dec 28, 2012 at 11:16:36PM +0100, Leon Timmermans wrote: >> On Fri, Dec 28, 2012 at 11:06 PM, Tony Cook via RT >> <perlbug-followup@perl.org> wrote: >>> It should fail to open. If you open a UTF8 flagged string for append >>> and write non-UTF8 bytes you will produce an invalidly encoded SvUTF8 >>> string. >>> >>> Your patch as written ignores the principle that the SvUTF8() flag only >>> controls the internal encoding, not other behaviour. If the SV contains >>> only code point 0xFF or lower we should downgrade it and work with that >>> rather than failing (or producing a warning). >> >> I didn't see enough consensus to change it that much, but I would be in favor. >> >>> This should also be done for _read() and _write(), since the SV can be >>> modified between I/O operations. >>> >>> There's an unrelated problem that _pushed() checks flags on both arg and >>> SvRV(arg) without calling SvGETMAGIC(). >> >> It should just stop peeking and poking into the SV altogether, and use >> the proper APIs (sv_insert and friends). For that matter, I sometimes >> feel like it should be rewritten from scratch to actually make sense. >> Pretty much all of it is problematic. > > I've attached my suggested changes (in several parts), also available > on perl5.git.perl.org/perl.git as tonyc/perlio-scalar-sanity. > > Reasons for failing instead of warning: > > 1) reading - to follow the "SVf_UTF8 is only representation" > principle, we'd need to download where possible, so a \xA1 (for > example) in the stream is always treated as that byte, but this means > we have an inconsistency when the scalar cannot be downgraded - the > first bytes of the character sequences "\xA1\x40" and "\xA1\x{101}" > would be different. > > 2) writing - if the SV is flagged UTF8, and the user of the handle > doesn't write correct UTF8 data at the correct offsets, the SV will no > longer be properly formed utf-8, which I believe we're trying to > maintain. One of my tests produced a warning about invalid UTF-8 > during before the fix was applied. > > It's possible could be avoided if we always treat the written bytes as > code points and upgrade them when writing to a UTF8 string, but then > we run into a consitency issue vs reading - what happens when a read > on a UTF8 string reaches a code point > 0xFF? > > As written I think the warning message could be improved and the > documentation of the warning could be improved. > > Tony > Attached are some suggestions for wording changes. I've never liked our distinction between bytes and character semantics. It makes no sense to me. Everything is ultimately a byte.Thread Previous | Thread Next