On Fri, Nov 26, 2010 at 01:03:20PM -0800, Reverend Chip wrote: > On 11/26/2010 2:25 AM, Nicholas Clark wrote: > > On Fri, Nov 26, 2010 at 02:20:40AM -0800, Reverend Chip wrote: > >> On 11/26/2010 1:23 AM, Nicholas Clark wrote: > >>> Isn't the bug that perl let someone create an invalid data structure? > >> That's an internally consistent position (no pun intended). But does > >> the utf8 flag truly count as internal if manipulating it is both easy > >> and well-documented for users? > > easy (yes, too easy), documented (maybe, not well enough, particularly about > > what it's about) and WRONG. > > You seriously equate Encode::_utf8_on() with, say, playing around with > optrees using B? You seriously equate a bad pointer in an SV to a > misplaced byte in a utf8 string? Yes. Totally. It's documented as [INTERNAL] Turns on the UTF8 flag in STRING. The data in STRING is B<not> checked for being well-formed UTF-8. Do not use unless you B<know> that the STRING is well-formed UTF-8. and the leading underscore is a convention too for "internal use". I'd really prefer that it didn't exist at all. > > (WRONG in the general case. It feels like an awful lot of end-user code to > > deal with encodings is heuristics and bodgery, rather than actual > > understanding) > > Very true, and a source of perpetual annoyance. But it's a separate > issue, isn't it? Not in my mind. Finding the need to resort to flipping the internal flag for UTF-8 is a red flag that the proper conversion layer isn't implemented, because the flow of data hasn't been thought about. > >> As a separate matter, perhaps we can at least agree that assert() is an > >> unfriendly thing for Perl to do in this case [...] > > Where do you stop? > > Well, I wrote "in this case", so we would stop here. It would be a > concession to usability based on manipulation of the utf8 flag being > easy and documented (as you acknowledged). No, I don't think that we'd stop "here", where "here" is that part of the regexp engine. To be sure that we don't SEGV or fail assertions anywhere in the codebase if buffers as marked with SvUTF8() when they are not valid UTF-8, we'd have to check AT EVERY PLACE that they are what they say they are. Which I don't think is viable. Nicholas ClarkThread Previous | Thread Next