develooper Front page | perl.perl5.porters | Postings from July 2011

Re: Change of UTF-8 bad byte handling between 5.13.1 and 5.13.8

Thread Previous | Thread Next
Paul LeoNerd Evans
July 22, 2011 07:45
Re: Change of UTF-8 bad byte handling between 5.13.1 and 5.13.8
Message ID:
On Sat, Jun 25, 2011 at 04:26:03PM -0600, Karl Williamson wrote:
> So, it now defers the check for validity until the input character
> is completely read, and you have short-circuited that with
> STOP_AT_PARTIAL.  Encode could check for input sequences that under
> strict unicode would lead to something larger than 0x10_FFFF, even
> with partial, but I don't think it is obliged to since this is an
> undocumented flag.

It does indeed appear to be doing that. Having read a >= 0x80 byte, it
decodes how many more bytes of valid UTF-8 that would require, then does
absolutely nothing else until those bytes are forthcoming.

Nothing. At all. Not even checking that those bytes are UTF-8
continuation bytes. A single high byte followed by US ASCII is and
always will be invalid UTF-8. This means that if the stream contains a
high-valued byte that would start a, say, 4 byte sequence, followed by 2
ASCII characters, the decoder will just sit there awaiting the 4th byte
before rejecting it as invalid, instead of throwing a wobbly now and
reading those two ASCII characters. This could be important if those are
in fact the final CR/LF at the end of a line. This is non-ideal.

It's a bit annoying, but I guess not -massively- disasterous. In any
case, I've just adjusted my unit test to work around this deferred
checking, and it now seems to be fine.

Thanks for the hint.

Paul "LeoNerd" Evans
ICQ# 4135350       |  Registered Linux# 179460

Thread Previous | Thread Next Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at | Group listing | About