Lets say I have this data
data = '''a, b, c
d, e, f
g. h, i
j, k , l
'''
4th line contains one single space, 6th and 7th line does not contain any space, just a blank new line.
Now when I split the same using splitlines
data.splitlines()
I get
['a, b, c', 'd, e, f', 'g. h, i', ' ', 'j, k , l', '', '']
However expected was just
['a, b, c', 'd, e, f', 'g. h, i', 'j, k , l']
Is there a simple solution using regular expressions to do this.
Please note that I know the other way of doing the same by filtering empty strings from the output of splitlines()
I am not sure if the same can be achieved using regex.
When I use regex to split on new line, it gives me
import re
re.split("\n", data)
Output :
['a, b, c', 'd,e,f', 'g. h, i', ' ', 'j, k , l', '', '', '']
CodePudding user response:
I disagree with your assessment that filtering is more complicated than using regular expressions. However, if you really want to use regex, you could split at multiple consecutive newlines like so:
>>> re.split(r"\n ", data)
['a, b, c', 'd, e, f', 'g. h, i', 'j, k , l', '']
Unfortunately, this leaves an empty string at the end of your list.
To get around this, use re.findall
to find everything that isn't a newline:
>>> re.findall(r"([^\n] )", data)
['a, b, c', 'd, e, f', 'g. h, i', 'j, k , l']
Since that regex doesn't work on input with spaces, here's one that does:
>>> re.findall(r"^([ \t]*\S.*)$", data, re.MULTILINE)
['a, b, c', 'd, e, f', 'g. h, i', 'j, k , l ']
Here's the explanation:
^([ \t]*\S.*)$
^ $ : Start of line and end of line
( ) : Capturing group
[ \t]* : Zero or more of blank space or tab (i.e. whitespace that isn't newline
\S : One non-whitespace character
.* : Zero or more of any character
CodePudding user response:
List comprehension approach
You can add elements to your list if they are not empty strings or whitespace ones with a condition check.
If the element/line is True
after stripping it from whitespaces, then it is different from an empty string, thus you add it to your list.
filtered_data = [el for el in data.splitlines() if el.strip()]
# ['a, b, c', 'd, e, f', 'g. h, i', 'j, k , l']
Regexp approach
import re
p = re.compile(r"^([^\s] . )", re.M)
p.findall(data)
# ['a, b, c', 'd, e, f', 'g. h, i', 'j, k , l']