develooper Front page | perl.perl5.porters | Postings from March 2012

Re: what was VMS do here? (was [perl #79960] Setting $/ to read fixedrecords can corrupt valid UTF-8 input)

Thread Previous | Thread Next
From:
Craig A. Berry
Date:
March 3, 2012 15:56
Subject:
Re: what was VMS do here? (was [perl #79960] Setting $/ to read fixedrecords can corrupt valid UTF-8 input)
Message ID:
A84FB64D-5490-4E2C-BBDA-4CDFE3B507FC@mac.com

On Mar 3, 2012, at 1:29 AM, Eric Brine wrote:

> On Fri, Mar 2, 2012 at 2:03 PM, Eric Brine <ikegami@adaelis.com> wrote:
> On Fri, Mar 2, 2012 at 9:11 AM, Craig A. Berry <craigberry@mac.com> wrote:
> I was thinking of a situation where something external to Perl limits how much data you can get in one read and thus gives you less than the full amount requested by $/.
> 
> That's exactly the situation I described. Here, let me provide the strace output.
> 
> $ strace perl -e'$/=\40; <>;' < /dev/random
> ...
> read(0, "\5|\200\"\360T0*\325\223\276\322\20S\244\16\341", 8192) = 17
> read(0, "\370\356 \2652\236\27>", 8192) = 8
> read(0, "\0\270\ve\332\223\225\312", 8192) = 8
> read(0, "\316\366\272\311\215.\204\361", 8192) = 8
> ...
>  
>  I'm pretty sure you'll get mangled UTF-8 if you happen to be mid-character when you hit the end of the device buffer.
> 
> No, because Perl will just ask for more. You'll get mangled UTF-8 if you happen to request a number of bytes that ends you mid-character (which is what this ticket is about).
> 
> (If we were talking about sysread instead of readline or read, then yes, it could happen then. Unlike read and readline, sysread returns as soon as bytes are available.)
> 
> And here's an example where one character is read using two reads:
> 
> $ perl -C -e'print "a"x8191, chr(0x2660)' > x
> 
> $ ls -l x
> -rw------- 1 ikegami group 8194 Mar  2 23:26 x
> 
> $ perl -le'use open ":std", ":utf8"; $/=\8194; $_=<>; print $_ eq ("a"x8191).chr(0x2660) ?1:0;' < x
> 1
> 
> strace:
> 
> read(0, "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa"..., 8192) = 8192
> read(0, "\231\240", 8192)               = 2
> 

Thanks for clarifying my muddy thinking, Eric.  I was neglecting the effects of the buffering layer because it's not used for record  mode on VMS and I had erroneously convinced myself that it's not used elsewhere either, but it is.  As long as the perlio buffer is larger than the requested record size, it looks like it will insulate you from anything external to Perl giving you less than the requested size.  

So does your second example demonstrate that if you request something larger than the perlio buffer, then you can get caught mid-character on buffer boundaries as well as record boundaries?  And does that first 8192-byte chunk get loaded into an SV that is then invalid if its UTF-8 flag is on?

________________________________________
Craig A. Berry
mailto:craigberry@mac.com

"... getting out of a sonnet is much more
 difficult than getting in."
                 Brad Leithauser


Thread Previous | Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About