develooper Front page | perl.perl5.porters | Postings from November 2008

Re: PATCH [perl #59342] chr(0400) =~ /\400/ fails for >= 400

Thread Previous | Thread Next
From:
karl williamson
Date:
November 13, 2008 12:24
Subject:
Re: PATCH [perl #59342] chr(0400) =~ /\400/ fails for >= 400
Message ID:
491C8CCF.9020602@khwilliamson.com
demerphq wrote:
> 2008/11/13 karl williamson <public@khwilliamson.com>:
>> demerphq wrote:
> [snip]
>>> Assuming that grok_oct() consumes at most 3 octal digits, I think we
>>> can apply Karls patch. However I do think we should recommend against
>>> using octal IN REGULAR EXPRESSIONS. And should note that while you CAN
>>> use octal to represent codepoints up to 511 it is strongly recommended
>>> that you don't.
>>>
>>> Also I have a concern that Karls patch merely modifies the behaviour
>>> in the regular expression engine. It doesn't do the same for other
>>> strings. If it is going to be legal it should be legal everywhere.
>>>
>> grok_oct() itself consumes as many octal digits as there are in its
>> parameter, as long as the result doesn't overflow a UV.  It is used for
>> general purpose octal conversion, such as from the oct() function.
> 
> Somewhere tho we have to have a limit on the number of digits dont we?
> 
> (I'm very tired right now and haven't looked)
> 

You pass it a maximum length, and the regcomp.c passes it 3.  The oct 
function passes it the length it actually is.


>> My patch was to bring consistency to the handling of \400-\777.  Outside
>> re's, putting them into a string variable will cause the string to be
>> converted to utf8, and so they will be converted into two utf8 bytes as part
>> of that string.  Similarly, using any of these octal values in an re
>> charclass will cause the re to be converted to utf8, and will match the
>> corresponding unicode code point.  But when values in this range appear in
>> an re outside a charclass there an inconsistency.  On an 8-bit character
>> machine (if there aren't 256 or so parenthetical sub expressions in the re)
>> they will match a two character sequence, but not the same utf8 sequence
>> matched if they had instead appeared in a charclass.  I'm not sure what
>> would happen on a 9-bit machine.  It might very well be what Glenn suggests,
>> the corresponding 9 bits.
>>
>> Tom has pointed out that \777 is a reserved value in some contexts.
> 
> Oh? I missed that.
> 
>> It seems to me to be a bad idea to remove acceptance of octal numbers in
>> re's.
> 
> Yes I think thats been well demonstrated.
> 
>> It seems like a good idea to add something to the language so one can
>> express them unambiguously.  Even I with my limited knowledge of regcomp.c
>> could do it easily (fools rush in...).
> 
> Sometimes being able to do something is simply not having the fear
> that it might be too hard. :-)
> 
>> And it seems like an even better idea to handle them consistently.  I see
>> two ways to do that 1) accept my patch; or
> 
> Thats pretty much a given. I just haven't had the time yet.
> 
> And well while it was being contentiously debated I wanted to wait and
> see a bit. :-)
> 
>> 2) forbid or warn about the use
>> of those larger than a single character in the machine architecture in both
>> strings and re's, including char classes.
> 
> I need to think about this one.
> 
>> Perhaps I've forgotten something in this thread.  If so, I'm sorry.
> 
> Please don't be sorry. For me you a welcome breath of fresh air. It's
> wonderful to have you on board.
> 
> Yves
> 
Thankyou

Thread Previous | Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About