develooper Front page | perl.perl5.porters | Postings from September 2011

Re: [perl #100058] Perl leaves broken UTF-8 in SVs whose UTF8 isset

Thread Previous | Thread Next
Karl Williamson
September 28, 2011 16:36
Re: [perl #100058] Perl leaves broken UTF-8 in SVs whose UTF8 isset
Message ID:
On 09/28/2011 05:50 AM, Nicholas Clark wrote:
> On Tue, Sep 27, 2011 at 05:09:33PM -0600, Karl Williamson wrote:
>> My understanding is that the the original reason for not doing the input
>> checks was performance.  Security is a far more important issue now, and
>> Nicholas has demonstrated code that does the parsing with a minimal
>> performance hit.
> I had hoped to work on it over last Christmas, but everyone got ill and
> my laptop power supply failed. So it didn't happen.
> Whilst I have a feel for how to do it for UTF-8, I have no idea how do to
> it for UTF-8 and UTF-EBCDIC, or at least "not break EBCDIC platforms" or
> "make something hard to port to EBCDIC" as a side effect.

I believe I have the expertise to take what you do for UTF-8 and extend 
it to work on UTF-EBCDIC.  I have in the past ginned up a test platform 
to test some EBCDIC things on Linux; and this looks like a feasible 
candidate for the same treatment.
> I also wasn't sure how to benchmark it properly, to be confident about the
> magnitude of the performance change. I had thought that my test code should
> be *more* efficient that the current code in utf8.c [it did less work], but
> all the numbers I could collect showed it to be slightly slower. Hence why
> I'm not trusting my intuition about what's happening.

I remember seeing the code somewhere, and thinking that it could be 
faster than what we have already.  I believe that the security concerns 
of not doing anything out-weigh any performance impacts.  I suspect 
there are performance experts on this list that Jesse could lean on to 
evaluate this extremely important work, which should help keep us from 
getting more CVEs.

> It's also blocking on lack of feedback to bug #79960

So, here's my comments on that bug.  FWIW, here is a link to what 
Unicode says should happen for input validation

I have never used $/ set to a fixed length, but reading the pod, it 
appears to me that the crux of the matter is this, "[it] will attempt to 
read records instead of lines, with the maximum record size being the 
referenced integer."  It also says, "any file you'd want to read in 
record mode is probably unusable in line mode."  That tells me it is ok 
to croak in this situation.

But why not just return only as many complete characters as will fit in 
the fixed length, leaving the pointer at the beginning of the next 
partial character?  The documentation already says that you can't always 
expect a full-length record; and it doesn't say this occurs just at EOF. 
  It would croak if that partial character is too long to ever fit ($/ 
being very small, as in some of your examples).

I do think that the buffer length should only be construed as bytes and 
not characters.

Thread Previous | Thread Next Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at | Group listing | About