Not getting expected output for some reason?-CodePudding

Question: please debug logic to reflect expected output

import re

text = "Hello there."

word_list = []

for word in text.split():

tmp = re.split(r'(\W )', word)

word_list.extend(tmp)

print(word_list)

OUTPUT is :

['Hello', 'there', '.', '']

Problem: needs to be expected without space

Expected :['Hello', 'there', '.']

CodePudding user response：

First of all the actual output you shared is not right, it is ['Hello', ' ', 'there', '.', ''] because-

The \W, Matches anything other than a letter, digit or underscore. Equivalent to [^a-zA-Z0-9_] so it is splitting your string by space(\s) and literal dot(.) character

So if you want to get the expected output you need to do some further processing like the below-

With Earlier Code:

import re
s = "Hello there."
l = list(filter(str.strip,re.split(r"(\W )", s)))
print(l)

With Edited code:

import re
text = "Hello there."
word_list = []
for word in text.split():
    tmp = re.split(r'(\W )', word)
    word_list.extend(tmp)
print(list(filter(None,word_list)))

Output:

['Hello', 'there', '.']

Working Code: https://rextester.com/KWJN38243

CodePudding user response：

assuming word is "Hello there.", the results make sense. See the split function documentation: Split string by the occurrences of pattern. If capturing parentheses are used in pattern, then the text of all groups in the pattern are also returned as part of the resulting list.

You have put capturing parenthesis in the pattern, so you are splitting the string on non-word characters, and also return the characters used for splitting.

Here is the string: Hello there.

Here is how it is split: Hello|there|

that means you have three values: hello there and an empty string '' in the last place.

And the values you split on are a space and a period

So the output should be the three values and the two characters that we split on:

hello - space - there - period - empty string

which is exactly what I get.

import re

s = "Hello there."
t = re.split(r"(\W )", s)
print(t)

output: ['Hello', ' ', 'there', '.', '']

Further Explanation

From your question is may be that you think because the string ends with a non-word character that there would be nothing "after" it, but this is not how splitting works. If you think back to CSV files (which have been around forever, and consider a CSV file like this:

date,product,qty,price
20220821,P1,10,20.00
20220821,P2,10,

The above represents a csv file with four fields, but in line two the last field (which definitely exists) is missing. And it would be parsed as an empty string if we split on the comma.