Home > Software engineering >  Regex handling multiple groups form a potentially comma delimited list
Regex handling multiple groups form a potentially comma delimited list

Time:12-30

I'm trying to parse a comma separated list with multiple capture groups in each element via regex.

Sample Text

col1 = 'Test String' , col2= 'Next Test String',col3='Last Text String', col4=37

I've tried using various variants of this regex

(.*?)\s?=\s?(.*?)\s?,?

But it never gives me what I want or if it gets close it can't cope with there being just one element or vice versa.

What I'm expecting is a list of Matches with 3 groups

Match1 group 0 the whole match
Match1 group 1 col1
Match1 group 2 'Test String'
Match2 group 0 the whole match
Match2 group 1 col2
Match2 group 2 'Next Test String'
Match3 group 0 the whole match
Match3 group 1 col3
Match3 group 2 'Last Test String'
Match4 group 0 the whole match
Match4 group 1 col4
Match4 group 2 37

(Note I'm only interested in groups 1 & 2)

I'm deliberately making this non language specific as I can't get it to work in online Regex debuggers, however, my target language is Python 3

Thank you in advance and I hope I've made myself clear

CodePudding user response:

The (.*?)\s?=\s?(.*?)\s?,? regex has got only one obligatory pattern, =. The (.*?) at the start gets expanded up to the leftmost = and the group captures any text up to the leftmost = and an optional whitespace after it. The rest of the subpatterns do not have to match, if there is a whitespace, it is matched with \s?, if there are two, they are matched, too, and if there is a comma, it is also matched and consumed, the .*? part is simply skipped as it is lazy.

If you want to get the second capturing group with single quotes included, you can use

(?:,|^)\s*([^\s=] )\s*=\s*('[^']*'|\S )

See this regex pattern. It matches

  • (?:,|^) - a non-capturing group matching a , or start of string
  • \s* - zero or more whitespaces
  • ([^\s=] ) - Group 1: one or more chars other than whitespace and =
  • \s*=\s* - a = char enclosed with zero or more whitespaces
  • ('[^']*'|\S ) - Group 2: either ', zero or more non-'s, and a ', or one or more non-whitespaces.

If you want to exclude single quotes you can post-process the matches, or use an extra capturing group in '([^']*)', and then check if the group matched or not:

import re
text = "col1 = 'Test String' , col2= 'Next Test String',col3='Last Text String', col4=37"
pattern = r"([^,\s=] )\s*=\s*(?:'([^']*)'|(\S ))"
matches = re.findall(pattern, text)
print( dict([(x, z or y) for x,y,z in matches]) )
# => {'col1': 'Test String', 'col2': 'Next Test String', 'col3': 'Last Text String', 'col4': '37'}

See this Python demo.

If you want to do that with a pure regex, you can use a branch reset group:

import regex  # pip install regex
text = "col1 = 'Test String' , col2= 'Next Test String',col3='Last Text String', col4=37"
print( dict(regex.findall(r"([^,\s=] )\s*=\s*(?|'([^']*)'|(\S ))", text)) )

See the Python demo (regex demo).

  • Related