Home > OS >  Strange behavior of (python) str.split when using the default sep value (None)
Strange behavior of (python) str.split when using the default sep value (None)

Time:02-12

Does anyone have a clear way to explain the rule regarding str.split(sep=None)? The Docstring provides some explanation but not enough to understand the following behaviors

>>> s = '\n Hello\t World\nOpps\t '
>>> print(s.split()) #by default sep = None
['Hello', 'World', 'Opps']
>>> print(s.split(maxsplit = 1))
['Hello', 'World\nOpps\t ']
  • The first print seems reasonable since it splits whenever whitespace appears, and then dump all the empty strings.
  • The second one is a bit harder to understanding: if it only splits once, then it should condiser the 1st '\n' as the delimiter and results in an empty string '' (which should be dumped next) and ' Hello World\nOpps\t'.

Thank you in advance for any consistent and logical explanation you may provide.

PS: I have included the Docstring in below, and I am aware of this question, and an answer there offers some help but a clear rule with official source is still missing. Without official source explanation, a rule is just something for users to memorize without understanding.

sep
The delimiter according which to split the string. None (the default value) means split according to any whitespace, and discard empty strings from the result.

CodePudding user response:

You should look at the official docs, where this is explained in detail:

If sep is not specified or is None, a different splitting algorithm is applied: runs of consecutive whitespace are regarded as a single separator, and the result will contain no empty strings at the start or end if the string has leading or trailing whitespace. Consequently, splitting an empty string or a string consisting of just whitespace with a None separator returns [].

Bold emphasis added by me.

However, this does leave ambiguity regarding what a "split" is in the case of `sep=None, maxsplit=some_positive_number). But apparently, the leading and trailing whitespace acts, at least conceptually, as if it simply wasn't there. But notice:

>>> "   a   b c ".split(maxsplit=1)
['a', 'b c ']

So it isn't actually removed.

CodePudding user response:

Since the overflow FAQ and this stackoverflow blog post stated that it is encouraged to answer one's own question, I am posting a possible interpretation I can think of (edited after seeing the answer provided by @juanpa-arrivillaga and reading the official doc): when sep=None, the special rule for str.split is as follows.

  1. Attempt to remove all the leading whitespace(s)
  2. As a convention, runs of consecutive whitespace are regarded as a single separator according to the official doc.
  3. Apply the split once at a time using the convention of item 1 until maxsplit is reached.
  4. Disregard any empty string in the resulted list. An interesting observation is that even when maxsplit=0, python still conducts item 1, as illustrated by the following example.
>>> s = '\n Hello\t World\nOpps\t '
>>> print(s.split(maxsplit = 0))
['Hello\t World\nOpps\t ']
>>> print(s.split(maxsplit = 1))
['Hello', 'World\nOpps\t ']
>>> print(s.split(maxsplit = 2))
['Hello', 'World', 'Opps\t ']
>>> print(s.split(maxsplit = 3))
['Hello', 'World', 'Opps']
  • Related