Wyh there are empty items after re.split()?-CodePudding

I assume I misunderstand how re.split() works.

Here is a real and simple example.

>>> import re
>>> re.split('(abc)', 'abc')
['', 'abc', '']

I'm confused about the first and last empty ('') element in the resulting list. The result expected by me would be this:

['abc']

This was a very simplified example. Please let me give something more complex.

>>> re.split(r'\[\[(. ?)\]\[(. ?)\]\]', '[[one][two]]')
['', 'one', 'two', '']

Here the result expect by me would be:

['one', 'two']

This third example with words before and after works as expected.

>>> re.split(r'\[\[(. ?)\]\[(. ?)\]\]', 'zero [[one][two]] three')
['zero ', 'one', 'two', ' three']

My final goal is to split (tokenize) a string with a regex, get the splitted parts as results but also the separators (the regex matches). That is why I'm not able to handle that with re.findall().

CodePudding user response：

If you use capturing groups in the re.split expression, the splitting part (abc) is also returned in the output. This can be very useful with eg tokenization tasks.

Every second item in the return value is the captured split pattern; e. g. if (a.c) was the splitter and dabcdagde then splittee, you'd get ['d', 'abc', 'd', 'agd', 'e'].

In your first example, since the split expression is the whole string, you get empty strings "on the sides".

CodePudding user response：

My answer is based on that answer in a similar question.

The behavior is as specified in the docs:

If there are capturing groups in the separator and it matches at the start of the string, the result will start with an empty string. The same holds for the end of the string:

That way, separator components are always found at the same relative indices within the result list.

Especially the last sentence to describe why this behavior is useful.

In short: The user/developer is every time able to identify the separators/matches in the resulting list when using catch groups.

When using catching groups the user/developer always can expect the separators (the matches) at the same position in the resulting list. Assuming one catching group each second element in the result is the matched separator (the catched group).

If you have two catch groups as in my example the relative position changes. You have to count to three. 0 is the splitted token, 1 is the first catch group, 2 is the second catch group, and again...