develooper Front page | perl.perl5.porters | Postings from March 2013

Re: Zero-width split() match creates empty trailing strings but notempty leading strings

Thread Previous | Thread Next
From:
Aristotle Pagaltzis
Date:
March 10, 2013 17:32
Subject:
Re: Zero-width split() match creates empty trailing strings but notempty leading strings
Message ID:
20130310173209.GA26219@fernweh.plasmasturm.org
Hi Chris,

* Chris Povirk <cpovirk@google.com> [2013-01-19 00:05]:
> The perlfunc documentation spells this out clearly, and it matches
> what I see:
>
> $ perl -e 'for (split(//, "fob", -1)) { print "$_\n"; }' | sed -e
> 's/^$/<blank>/'
> f
> o
> b
> <blank>
>
> The question on my mind is why.

it’s not an accident of implementation exactly, but you might call it an
accident of semantics. It’s due to `split` being defined in terms of
pattern matching and due to how pattern matching operates. Consider the
output you get from the following:

    perl -Mre=debug -E 'sub x{say "-"x72} $_ = "fob"; x; x while /(?:)/g; x'

(This pattern yields a regexp program identical to that in `split //`.)

Note that the regexp engine starts matching at position 0, then bumps
along the string as it detects a match at the same position that it
previously matched at (all these “Match possible, but” lines).

Note well that it succeeds matching the empty pattern at position 3,
i.e. just beyond the end of the string. That last one is where the
trailing empty field comes from.

The curious thing here is that it also succeeds matching the empty
pattern at the *start* of the string – yet the leading empty field is
never present in the output from `split`! Evidently, `split` actively
suppresses this initial empty field.

> Thanks for any pointers. I tried to find the source for split, and I
> think I may have found it in pp_split in pp.c. But there's not really
> any reason to expect the source code to include the justification, and
> I couldn't find one.
>
> http://docs.guava-libraries.googlecode.com/git/javadoc/com/google/common/base/Splitter.html

Presumably `split` suppresses the leading empty field because that one
cannot be suppressed selectively by the user as trailing empty fields
can, and it would therefore always present an obstacle to step around
in user code.

My best guess is that the user’s ability to expose the trailing empty
match using a limit of -1 was deemed harmless, but conversely, giving
the user a corresponding ability to expose the leading empty match was
deemed not worthwhile.

If this reasoning is correct, then the demonstrated behaviour with the
explicit limit and the trailing empty field has no particular semantic
worth, negative or positive, and is essentially arbitrary. The aim was
simply to make `split` DWIM in the simple case.

> In particular, is it a decision worth replicating to language's
> libraries?

In light of the above I’d say the answer is: do as you will.

If your splitter function is defined in terms of pattern matching, and
you follow the example of Perl’s `split` regarding an absent vs negative
limit, and your regexp engine operates in a way that would lead to these
empty matches, etc. – then you might follow Perl’s example and simply
suppress only the leading empty field while leaving the absent limit to
suppress the trailing fields.

If this is not how your splitter function works – then I don’t see the
reason to go out of your way in order to emulate the behaviour of Perl’s
`split` either.

HTH,
-- 
Aristotle Pagaltzis // <http://plasmasturm.org/>

Thread Previous | Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About