On 09/28/2011 05:50 AM, Nicholas Clark wrote: > On Tue, Sep 27, 2011 at 05:09:33PM -0600, Karl Williamson wrote: > >> My understanding is that the the original reason for not doing the input >> checks was performance. Security is a far more important issue now, and >> Nicholas has demonstrated code that does the parsing with a minimal >> performance hit. > > I had hoped to work on it over last Christmas, but everyone got ill and > my laptop power supply failed. So it didn't happen. > > Whilst I have a feel for how to do it for UTF-8, I have no idea how do to > it for UTF-8 and UTF-EBCDIC, or at least "not break EBCDIC platforms" or > "make something hard to port to EBCDIC" as a side effect. I believe I have the expertise to take what you do for UTF-8 and extend it to work on UTF-EBCDIC. I have in the past ginned up a test platform to test some EBCDIC things on Linux; and this looks like a feasible candidate for the same treatment. > > I also wasn't sure how to benchmark it properly, to be confident about the > magnitude of the performance change. I had thought that my test code should > be *more* efficient that the current code in utf8.c [it did less work], but > all the numbers I could collect showed it to be slightly slower. Hence why > I'm not trusting my intuition about what's happening. > I remember seeing the code somewhere, and thinking that it could be faster than what we have already. I believe that the security concerns of not doing anything out-weigh any performance impacts. I suspect there are performance experts on this list that Jesse could lean on to evaluate this extremely important work, which should help keep us from getting more CVEs. > It's also blocking on lack of feedback to bug #79960 So, here's my comments on that bug. FWIW, here is a link to what Unicode says should happen for input validation http://www.unicode.org/versions/Unicode6.0.0/ch03.pdf#page=42 I have never used $/ set to a fixed length, but reading the pod, it appears to me that the crux of the matter is this, "[it] will attempt to read records instead of lines, with the maximum record size being the referenced integer." It also says, "any file you'd want to read in record mode is probably unusable in line mode." That tells me it is ok to croak in this situation. But why not just return only as many complete characters as will fit in the fixed length, leaving the pointer at the beginning of the next partial character? The documentation already says that you can't always expect a full-length record; and it doesn't say this occurs just at EOF. It would croak if that partial character is too long to ever fit ($/ being very small, as in some of your examples). I do think that the buffer length should only be construed as bytes and not characters.Thread Previous | Thread Next