* Craig A. Berry <craig.a.berry@gmail.com> [2014-03-25 18:45]: > If it's as broken by design as you imply, I wonder why it works. It seems to, because if you try to output a string containing "\x{100}" or beyond (which is impossible to represent in a single byte), perl will try to keep trucking instead of just throwing an error, and in absence of anything better do it outputs the UTF-8 representation of the string. I.e. "\x80\x{20AC}" will be output as 2 + 3 bytes. But if your string does not contain such characters, even though it may then have either UTF8=on or off, any "\x80" thru "\xFF" characters in it will always be output as single bytes. I.e. "\x80\xFF" will be output as 1 + 1 bytes. Which are not going to be valid UTF-8. There is a class of broken Perl programs which, under this arrangement, happen to be broken in just the right way that the breakage is rarely or possibly even never apparent. (But if you think your code falls in the “never” category, you are nearly certainly mistaken.) > Everything is read in raw and written out raw *except* the message > body, which may have been generated by any random editor on any > platform. For that I attempt to infer the locale encoding and read > into a handle decoding that specific encoding (which may not be > working right per Zefram though I haven't confirmed that yet). That’s fine. > Since I'm decoding on input, the data at that point have been > converted to Perl's internal encoding, which as far as I know is UTF-8 > (or a lax variant of it) when it needs to be. Then I write it out raw > and say in the MIME header that it's UTF-8. Yes, I'm intentionally > losing the encoding on the output handle. I pretty much have to since > that handle may get data in multiple unknown encodings. Using a raw handle is fine. Printing decoded strings to it is not. You are missing the encoding step. You don’t need to put an encoding layer on the handle to do that; you can encode explicitly beforehand, just so long as you make sure to re-encode at any stage before output. This will turn a string that contains "\x{20AC}" into one that contains the 3-character sequence "\xE2\x82\xAC" representing the corresponding octets, which are what you want to output. These are different strings, even if the bytes in the PV buffer are the same: length() will return different values, what . matches during a regexp will differ, etc. Meanwhile the way that U+20AC is represented internally as a single character should be no concern of yours: your code should not be written to start behaving differently if perl were to switch to internal representation in UTF-16 (yeah, right) or UTF-32/NFG (maybe someday?), say. I don’t know if you do anything with the strings after they are decoded. If not, you can just transcode them from the locale encoding straight to the UTF-8 encoding using Encode.pm’s `from_to` instead of decoding and only later encoding them. That may or may not be easier. > I also don't specify character encodings in the MIME headers on the > attachment(s) since there is no way to know what they are. But that's > not a problem since the main use case is patches created by > git-format-patch and git seems happy as long as we don't mangle > anything in transport. That’s fine. Regards, -- Aristotle Pagaltzis // <http://plasmasturm.org/>Thread Previous | Thread Next