Home > Software design >  Using regex in python to tell a string contains comma separated list
Using regex in python to tell a string contains comma separated list

Time:02-24

I am trying to perform a simple regex pattern match in python to tell whether an input string is a comma separated list. Examples of input list are,

input = '[1]'
input = "['Yes']"
input = '["Yes"]'
input = '["Yes",1,"No"],["HIGH","MEDIUM","LOW"]'
input = "[1,2,3], ['High', 'Medium', 'Low']"

etc. Now when I try to match the regex pattern for a single list, it works okay. So for a single list I do the below,

import re
pattern = re.compile(r'^\[(((\". \")|(\'. \')|(\d )),?) \]$')
input = '["Yes", 12, "No"]'
print(pattern.match(input))
print(pattern.match(input).string)

and I get the desired output

<re.Match object; span=(0, 17), match='["Yes", 12, "No"]'>
["Yes", 12, "No"]

However, for testing a similar pattern on a string containing multiple lists.

import re
pattern = re.compile(r'^((\[(((\". \")|(\'. \')|(\d )),?) \]),?) $')
input = "[1,2,3],['High','Medium','Low']"
print(pattern.match(input))
print(pattern.match(input).string)

This works okay and I get the below output.

<re.Match object; span=(0, 31), match="[1,2,3],['High','Medium','Low']">
[1,2,3],['High','Medium','Low']]

However, if I want to find individual lists using the regex findall method, it doesn't work. So, if I do the below. Note that the pattern below is for a single list item without the line beginning ^ and line ending $ symbols.

import re
pattern = re.compile(r'\[(((\". \")|(\'. \')|(\d )),?) \]')
input = "[1,2,3],['High','Medium','Low']"
pattern.findall(input)

I get the output:

[('3', '3', '', '', '3'),
 ("'High','Medium','Low'",
  "'High','Medium','Low'",
  '',
  "'High','Medium','Low'",
  '')]

So the matching completely ignored the list [1,2,3]. Further for the match 'High','Medium','Low' is missing the list beginning '[' and ending ']'.

Also, I am wondering if there is a better way to write this regex without using ast.literal_eval.

CodePudding user response:

You could use r'(\[(((\". \")|(\'. \')|(\d )),?) \])' as the pattern and then extract the desired matches, i.e.

pattern = re.compile(r'(\[(((\". \")|(\'. \')|(\d )),?) \])')
matches = list(map(lambda x:x[0], pattern.findall(s)))

Then matches will be ['[1,2,3]', "['High','Medium','Low']"]

CodePudding user response:

I don't think I got the whole idea of all this, so sorry for any misunderstanding by my part.

I tried using the pattern (\[. ?\]),? with re.findall() and got the following output:

import re
pattern = re.compile(r'(\[. ?\]),?')
input = "[1,2,3], ['High', xyz'Medium','Low']"
>>> pattern.findall(input)
['[1,2,3]', "['High', xyz'Medium','Low']"]

Is this what you meant to get?

CodePudding user response:

This is what worked in the end:

import re
pattern = re.compile(r'(\[(((\"[^,\"\'\[\]] \")|(\'[^,\"\'\[\]] \')|(\d )),?\s?) \])')
input = "[1,2,3], [1,2,'Low'], ['High', 'Medium','Low']"
matches = list(map(lambda x:x[0], pattern.findall(input)))
matches

Output:

['[1,2,3]', "[1,2,'Low']", "['High', 'Medium','Low']"]

If you provide a garbage value in one of the lists in the input it won't consider it. Say for example if the input was:

input = "[1,2,3], [1,2,'Low'], ['High', xyz'Medium','Low']"

the output would only contain the valid lists:

['[1,2,3]', "[1,2,'Low']"]

I came up with this solution with the answer provided by @Konny.

  • Related