develooper Front page | perl.perl5.porters | Postings from February 2015

RFC: /w pattern modifier

Thread Next
Karl Williamson
February 8, 2015 05:52
RFC: /w pattern modifier
Message ID:
As discussed many months ago, I am implementing \b{...} to allow more 
boundary types than plain \b.

The three types that will be in 5.22 are
\b{gcb}  grapheme cluster break.  \X is defined as .+?\b{gcb}

\b{sb}   sentence break.  Is true if Unicode thinks this is a boundary 
between two sentences.  It does a decent job of this, but it thinks that 
"Mr. Jones" is 2 sentences

\b{wb}   word break.  Is true if Unicode thinks this is boundary between 
two words.

Unicode also defines a "Line Break", which could be implemented as 
\b{lb}, but I'm not sure of the usefulness of this given that there is a 
Unicode::LineBreak CPAN module already available.

Straight \b is true at boundaries between \w and \W characters.  I'm 
told that Perl newbies tend to think of \b as being more like a \s,\S 
boundary.  I considered implementing this (as it's almost trivial to 
do), (\b{space} could mean that), but in thinking about it, it appears 
to me that what they really want is \b{wb} which gives better results 
for natural languages.  For example, it should make "don't" a word in 
the phrase "... don't.)", including the apostrophe but excluding the 
parenthesis.  I see no need to implement an inferior version just 
because it's easy to do.

It has now occurred to me that a lot of existing \b uses really would 
work better if they were \b{wb}.  And that can be accomplished without 
having to change every occurrence, by instead having a pattern modifier 
flag, which could be in a 'use re "/w"' which says treat plain \b as 
\b{wb} in its scope.

I don't see any real use for pretending that \b is any of the other 
break types, so I think this is the only modifier affecting \b that 
would ever make sense.

I'm not sure how I feel about this, but I thought I should throw it out 
there to garner feedback.

Thread Next Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at | Group listing | About