Home > database >  How to split by newline and ignore blank lines using regex?
How to split by newline and ignore blank lines using regex?

Time:06-17

Lets say I have this data

data = '''a, b, c
d, e, f
g. h, i
  
j, k , l


'''

4th line contains one single space, 6th and 7th line does not contain any space, just a blank new line.

Now when I split the same using splitlines

data.splitlines()

I get

['a, b, c', 'd, e, f', 'g. h, i', ' ', 'j, k , l', '', '']

However expected was just

['a, b, c', 'd, e, f', 'g. h, i', 'j, k , l']

Is there a simple solution using regular expressions to do this.

Please note that I know the other way of doing the same by filtering empty strings from the output of splitlines()

I am not sure if the same can be achieved using regex.

When I use regex to split on new line, it gives me

import re
re.split("\n", data)

Output :

['a, b, c', 'd,e,f', 'g. h, i', ' ', 'j, k , l', '', '', '']

CodePudding user response:

I disagree with your assessment that filtering is more complicated than using regular expressions. However, if you really want to use regex, you could split at multiple consecutive newlines like so:

>>> re.split(r"\n ", data)
['a, b, c', 'd, e, f', 'g. h, i', 'j, k , l', '']

Unfortunately, this leaves an empty string at the end of your list. To get around this, use re.findall to find everything that isn't a newline:

>>> re.findall(r"([^\n] )", data)
['a, b, c', 'd, e, f', 'g. h, i', 'j, k , l']

Since that regex doesn't work on input with spaces, here's one that does:

>>> re.findall(r"^([ \t]*\S.*)$", data, re.MULTILINE)
['a, b, c', 'd, e, f', 'g. h, i', 'j, k , l ']

Here's the explanation:

^([ \t]*\S.*)$
^            $   : Start of line and end of line
 (          )    : Capturing group
  [ \t]*         : Zero or more of blank space or tab (i.e. whitespace that isn't newline
        \S       : One non-whitespace character
          .*     : Zero or more of any character
            

CodePudding user response:

List comprehension approach

You can add elements to your list if they are not empty strings or whitespace ones with a condition check.

If the element/line is True after stripping it from whitespaces, then it is different from an empty string, thus you add it to your list.

filtered_data = [el for el in data.splitlines() if el.strip()]
# ['a, b, c', 'd, e, f', 'g. h, i', 'j, k , l']

Regexp approach

import re
p = re.compile(r"^([^\s] . )", re.M)
p.findall(data)
# ['a, b, c', 'd, e, f', 'g. h, i', 'j, k , l']
  • Related