Question: please debug logic to reflect expected output
import re
text = "Hello there."
word_list = []
for word in text.split():
tmp = re.split(r'(\W )', word)
word_list.extend(tmp)
print(word_list)
OUTPUT is :
['Hello', 'there', '.', '']
Problem: needs to be expected without space
Expected :['Hello', 'there', '.']
CodePudding user response:
First of all the actual output you shared is not right, it is ['Hello', ' ', 'there', '.', '']
because-
The \W
, Matches anything other than a letter, digit or underscore. Equivalent to [^a-zA-Z0-9_]
so it is splitting your string by space(\s
) and literal dot(.
) character
So if you want to get the expected output you need to do some further processing like the below-
With Earlier Code:
import re
s = "Hello there."
l = list(filter(str.strip,re.split(r"(\W )", s)))
print(l)
With Edited code:
import re
text = "Hello there."
word_list = []
for word in text.split():
tmp = re.split(r'(\W )', word)
word_list.extend(tmp)
print(list(filter(None,word_list)))
Output:
['Hello', 'there', '.']
Working Code: https://rextester.com/KWJN38243
CodePudding user response:
assuming word
is "Hello there.", the results make sense. See the split function documentation: Split string by the occurrences of pattern. If capturing parentheses are used in pattern, then the text of all groups in the pattern are also returned as part of the resulting list.
You have put capturing parenthesis in the pattern, so you are splitting the string on non-word characters, and also return the characters used for splitting.
Here is the string: Hello there.
Here is how it is split: Hello|there|
that means you have three values: hello
there
and an empty string ''
in the last place.
And the values you split on are a space and a period
So the output should be the three values and the two characters that we split on:
hello - space - there - period - empty string
which is exactly what I get.
import re
s = "Hello there."
t = re.split(r"(\W )", s)
print(t)
output:
['Hello', ' ', 'there', '.', '']
Further Explanation
From your question is may be that you think because the string ends with a non-word character that there would be nothing "after" it, but this is not how splitting works. If you think back to CSV files (which have been around forever, and consider a CSV file like this:
date,product,qty,price
20220821,P1,10,20.00
20220821,P2,10,
The above represents a csv file with four fields, but in line two the last field (which definitely exists) is missing. And it would be parsed as an empty string if we split on the comma.