develooper Front page | perl.perl5.porters | Postings from July 2018

RFC: \b{wb}

Thread Next
From:
Karl Williamson
Date:
July 13, 2018 16:10
Subject:
RFC: \b{wb}
Message ID:
c8478672-7431-495b-8870-6d1477ed366a@khwilliamson.com
I added this type of boundary in 5.22 based on the Unicode concept of 
what constitutes a word boundary.  But they considered every single 
white space character to be a separate word, so that, if you split, say, 
on word boundaries and you had two space characters in a row between 
words, you'd get a partition like

  |abc| | |def|

instead of the more intuitive

  |abc|  |def|

(where the '|' characters shown are otherwise invisible markers for the 
edge of a word.

So I tailored their algorithm to do what appeared more sane to me.  I 
haven't checked, but I imagine we had a discussion on p5p at the time.

I also filed a ticket with Unicode, asserting that their scheme led to 
non ideal results.  And they actually changed their scheme in the latest 
release, 11.0, to not split between spaces.

However, their new scheme differs from perl's.  There are three 
horizontal space characters that they consider to be separate words. 
These are TAB, NO BREAK SPACE, and FIGURE SPACE (the amount of space 
needed to represent the font's digit characters).  This means that if 
you have two tabs in a row (to indent for example), or two NBSP's in a 
row (to increase the width between two words for example while still 
keeping them on the same line) that they will each be separate words. 
This makes little sense to me, and so I'm tempted to retain the original 
tailoring that perl has had for several releases now.  But I thought I 
should ask for comments from the community.

Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About