Home > OS >  Using regex to remove the unnecessary whitespace to get the expected output
Using regex to remove the unnecessary whitespace to get the expected output

Time:11-10

I have a few unstructured data like this

test1     21;
 test2  22;
test3    [ 23 ];

and I want to remove the unnecessary whitespace and convert it into the list of two-item per row and the expected output should look like this

['test1', '21']
['test2', '22']
['test3', ['23']]

Now, I am using this regex sub method to remove the unnecessary whitespace

re.sub(r"\s ", " ", z.rstrip('\n').lstrip(' ').rstrip(';')).split(' ')

Now, the problem is that it is able to replace the unnecessary whitespace into single whitespace, which is fine. But the problem I am facing in the third example, where after and before the open and close bracket respectively, it has whitespace and that I what to remove. But using the above regex I am not able to.

This is the output currently I am getting

['test1', '21']
['test2', '22']
['test3', '[', '23', ']']

You may check the example here on pythontutor.

CodePudding user response:

You may use this regex with 2 capture groups:

(\w )\s (\[[^]] \]|\w );

RegEx Demo

RegEx Details:

  • (\w ): Match 1 word characters in first capture group
  • \s : Match 1 whitespaces
  • (\[[^]] \]|\w ): Match a [...] string or a word in second capture group
  • ;: Match a ;

Code:

>>> import re
>>> data = '''
... test1     21;
...  test2  22;
... test3    [ 23 ];
... '''
>>> res = []
>>>
>>> for i in re.findall(r'(\w )\s (\[[^]] \]|\w );', data):
...     res.append([ i[0], eval(re.sub(r'^(\[)\s*|\s*(\])$', r'\1"\2', i[1])) if i[1].startswith('[') else i[1] ])
...
>>> print (res)
[['test1', '21'], ['test2', '22'], ['test3', ['23']]]

CodePudding user response:

You can use

import re, ast
s="""test1     21;
 test2  22;
test3    [ 23 ];"""
output = [ast.literal_eval("["   re.sub(r'\s*,\s*(?=])', '', re.sub(r"\w ", r"'\g<0>',", " ".join(x.split())).strip(',;'))   "]") for x in s.split('\n')]
print(output)
# => [['test1', '21'], ['test2', '22'], ['test3', ['23']]]

See the Python demo.

Details:

  • " ".join(x.split()) - normalizes whitespace to single spaces between words
  • re.sub(r"\w ", r"'\g<0>',", ...).strip(',;') - adds single quotes around words and appends a comma after them, and then strips commas and semi-colons
  • re.sub(r'\s*,\s*(?=])', '', ...) - removes commas enclosed with optional whitespace that are followed with a ] char
  • "[" ... "]" - wrapping the previous result with square brackets
  • ast.literal_eval(...) casts the string to the list
  • Related