On Fri, Dec 28, 2012 at 11:16:36PM +0100, Leon Timmermans wrote: > On Fri, Dec 28, 2012 at 11:06 PM, Tony Cook via RT > <perlbug-followup@perl.org> wrote: > > It should fail to open. If you open a UTF8 flagged string for append > > and write non-UTF8 bytes you will produce an invalidly encoded SvUTF8 > > string. > > > > Your patch as written ignores the principle that the SvUTF8() flag only > > controls the internal encoding, not other behaviour. If the SV contains > > only code point 0xFF or lower we should downgrade it and work with that > > rather than failing (or producing a warning). > > I didn't see enough consensus to change it that much, but I would be in favor. > > > This should also be done for _read() and _write(), since the SV can be > > modified between I/O operations. > > > > There's an unrelated problem that _pushed() checks flags on both arg and > > SvRV(arg) without calling SvGETMAGIC(). > > It should just stop peeking and poking into the SV altogether, and use > the proper APIs (sv_insert and friends). For that matter, I sometimes > feel like it should be rewritten from scratch to actually make sense. > Pretty much all of it is problematic. I've attached my suggested changes (in several parts), also available on perl5.git.perl.org/perl.git as tonyc/perlio-scalar-sanity. Reasons for failing instead of warning: 1) reading - to follow the "SVf_UTF8 is only representation" principle, we'd need to download where possible, so a \xA1 (for example) in the stream is always treated as that byte, but this means we have an inconsistency when the scalar cannot be downgraded - the first bytes of the character sequences "\xA1\x40" and "\xA1\x{101}" would be different. 2) writing - if the SV is flagged UTF8, and the user of the handle doesn't write correct UTF8 data at the correct offsets, the SV will no longer be properly formed utf-8, which I believe we're trying to maintain. One of my tests produced a warning about invalid UTF-8 during before the fix was applied. It's possible could be avoided if we always treat the written bytes as code points and upgrade them when writing to a UTF8 string, but then we run into a consitency issue vs reading - what happens when a read on a UTF8 string reaches a code point > 0xFF? As written I think the warning message could be improved and the documentation of the warning could be improved. TonyThread Previous | Thread Next