develooper Front page | perl.perl5.porters | Postings from February 2012

Re: [perl #77654] quotemeta() fails to quote literal non-word characterunder utf8

Thread Previous | Thread Next
Karl Williamson
February 5, 2012 18:28
Re: [perl #77654] quotemeta() fails to quote literal non-word characterunder utf8
Message ID:
On 12/29/2010 03:57 AM, Dave Mitchell wrote:
> On Fri, Dec 17, 2010 at 08:11:15AM -0700, Tom Christiansen wrote:
>>> I've always wondered why a lone } or ] does not need escaping (they're
>>> only special after an opening { or [ has been seen), but a lone ) does.
>> So have I.  It could be worse: things like quantifiers still
>> need escaping to be made literals even if they couldn't quantify
>> something, such as at the beginning of a string.  A (poor) argument
>> could be made that in such a position, escaping isn't necessary
>> to infer function, and it seems to me some nasty regex dialects
>> do just that.  I certainly don't care for it.
>>> And I don't think Perl5 every will. There's so much code out there that
>>> doesn't escape \W characters outside of the dozen mentioned above (and
>>> if we see a newbie escaping a \W outside of the dozen, we pick on him),
>> Now that you mention it, you're right, we do.  Hadn't thought of that.
> Ok. How about the following resolution: we change it so that utf8 strings
> get chr(128)-chr(255) escaped, so that it matches the non-utf8 case, and
> leave chars>  255 unescaped. In some future world if chars>  255 start
> having special meaning to the regex engine, then we start escaping them
> too.

This proposal and all others died in 5.14 for lack of consensus.  This 
leaves the Unicode bug extant for quotemeta, and I would like to get it 
fixed.  Tom has told me privately that he's ok with changing things to 
get consistent rules for UTF8- vs non-UTF8 encoded strings.

I'm thinking we should just do what the original trouble ticket asks 
for, and what the documentation has always said, and that is to quote 
everything that matches [^a-zA-Z0-9_].  This agrees with the first part 
of Dave's proposal, but makes all above Latin1 chars also escaped.

I'm reopening this publicly now, in order to try to get resolution in 
the next week or so, so that we can do something for 5.16.  Either 
proposal is easy to implement, and fast in cpu cycles.

If we do this, does that close the door on later changing to use the 
pattern syntax should it ever become necessary?  I think that it 
doesn't.  This thread included extensive discussion on that.

Thread Previous | Thread Next Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at | Group listing | About