Front page | perl.perl5.porters |
Postings from July 2018
From: Karl Williamson
July 13, 2018 16:10
Message ID: firstname.lastname@example.org
I added this type of boundary in 5.22 based on the Unicode concept of
what constitutes a word boundary. But they considered every single
white space character to be a separate word, so that, if you split, say,
on word boundaries and you had two space characters in a row between
words, you'd get a partition like
|abc| | |def|
instead of the more intuitive
(where the '|' characters shown are otherwise invisible markers for the
edge of a word.
So I tailored their algorithm to do what appeared more sane to me. I
haven't checked, but I imagine we had a discussion on p5p at the time.
I also filed a ticket with Unicode, asserting that their scheme led to
non ideal results. And they actually changed their scheme in the latest
release, 11.0, to not split between spaces.
However, their new scheme differs from perl's. There are three
horizontal space characters that they consider to be separate words.
These are TAB, NO BREAK SPACE, and FIGURE SPACE (the amount of space
needed to represent the font's digit characters). This means that if
you have two tabs in a row (to indent for example), or two NBSP's in a
row (to increase the width between two words for example while still
keeping them on the same line) that they will each be separate words.
This makes little sense to me, and so I'm tempted to retain the original
tailoring that perl has had for several releases now. But I thought I
should ask for comments from the community.
by Karl Williamson