2008/11/13 karl williamson <public@khwilliamson.com>: > demerphq wrote: [snip] >> Assuming that grok_oct() consumes at most 3 octal digits, I think we >> can apply Karls patch. However I do think we should recommend against >> using octal IN REGULAR EXPRESSIONS. And should note that while you CAN >> use octal to represent codepoints up to 511 it is strongly recommended >> that you don't. >> >> Also I have a concern that Karls patch merely modifies the behaviour >> in the regular expression engine. It doesn't do the same for other >> strings. If it is going to be legal it should be legal everywhere. >> > grok_oct() itself consumes as many octal digits as there are in its > parameter, as long as the result doesn't overflow a UV. It is used for > general purpose octal conversion, such as from the oct() function. Somewhere tho we have to have a limit on the number of digits dont we? (I'm very tired right now and haven't looked) > My patch was to bring consistency to the handling of \400-\777. Outside > re's, putting them into a string variable will cause the string to be > converted to utf8, and so they will be converted into two utf8 bytes as part > of that string. Similarly, using any of these octal values in an re > charclass will cause the re to be converted to utf8, and will match the > corresponding unicode code point. But when values in this range appear in > an re outside a charclass there an inconsistency. On an 8-bit character > machine (if there aren't 256 or so parenthetical sub expressions in the re) > they will match a two character sequence, but not the same utf8 sequence > matched if they had instead appeared in a charclass. I'm not sure what > would happen on a 9-bit machine. It might very well be what Glenn suggests, > the corresponding 9 bits. > > Tom has pointed out that \777 is a reserved value in some contexts. Oh? I missed that. > It seems to me to be a bad idea to remove acceptance of octal numbers in > re's. Yes I think thats been well demonstrated. > It seems like a good idea to add something to the language so one can > express them unambiguously. Even I with my limited knowledge of regcomp.c > could do it easily (fools rush in...). Sometimes being able to do something is simply not having the fear that it might be too hard. :-) > And it seems like an even better idea to handle them consistently. I see > two ways to do that 1) accept my patch; or Thats pretty much a given. I just haven't had the time yet. And well while it was being contentiously debated I wanted to wait and see a bit. :-) > 2) forbid or warn about the use > of those larger than a single character in the machine architecture in both > strings and re's, including char classes. I need to think about this one. > Perhaps I've forgotten something in this thread. If so, I'm sorry. Please don't be sorry. For me you a welcome breath of fresh air. It's wonderful to have you on board. Yves -- perl -Mre=debug -e "/just|another|perl|hacker/"Thread Previous | Thread Next