I'm trying to parse a comma separated list with multiple capture groups in each element via regex.
Sample Text
col1 = 'Test String' , col2= 'Next Test String',col3='Last Text String', col4=37
I've tried using various variants of this regex
(.*?)\s?=\s?(.*?)\s?,?
But it never gives me what I want or if it gets close it can't cope with there being just one element or vice versa.
What I'm expecting is a list of Matches with 3 groups
Match1 group 0 the whole match
Match1 group 1 col1
Match1 group 2 'Test String'
Match2 group 0 the whole match
Match2 group 1 col2
Match2 group 2 'Next Test String'
Match3 group 0 the whole match
Match3 group 1 col3
Match3 group 2 'Last Test String'
Match4 group 0 the whole match
Match4 group 1 col4
Match4 group 2 37
(Note I'm only interested in groups 1 & 2)
I'm deliberately making this non language specific as I can't get it to work in online Regex debuggers, however, my target language is Python 3
Thank you in advance and I hope I've made myself clear
CodePudding user response:
The (.*?)\s?=\s?(.*?)\s?,?
regex has got only one obligatory pattern, =
. The (.*?)
at the start gets expanded up to the leftmost =
and the group captures any text up to the leftmost =
and an optional whitespace after it. The rest of the subpatterns do not have to match, if there is a whitespace, it is matched with \s?
, if there are two, they are matched, too, and if there is a comma, it is also matched and consumed, the .*?
part is simply skipped as it is lazy.
If you want to get the second capturing group with single quotes included, you can use
(?:,|^)\s*([^\s=] )\s*=\s*('[^']*'|\S )
See this regex pattern. It matches
(?:,|^)
- a non-capturing group matching a,
or start of string\s*
- zero or more whitespaces([^\s=] )
- Group 1: one or more chars other than whitespace and=
\s*=\s*
- a=
char enclosed with zero or more whitespaces('[^']*'|\S )
- Group 2: either'
, zero or more non-'
s, and a'
, or one or more non-whitespaces.
If you want to exclude single quotes you can post-process the matches, or use an extra capturing group in '([^']*)'
, and then check if the group matched or not:
import re
text = "col1 = 'Test String' , col2= 'Next Test String',col3='Last Text String', col4=37"
pattern = r"([^,\s=] )\s*=\s*(?:'([^']*)'|(\S ))"
matches = re.findall(pattern, text)
print( dict([(x, z or y) for x,y,z in matches]) )
# => {'col1': 'Test String', 'col2': 'Next Test String', 'col3': 'Last Text String', 'col4': '37'}
See this Python demo.
If you want to do that with a pure regex, you can use a branch reset group:
import regex # pip install regex
text = "col1 = 'Test String' , col2= 'Next Test String',col3='Last Text String', col4=37"
print( dict(regex.findall(r"([^,\s=] )\s*=\s*(?|'([^']*)'|(\S ))", text)) )
See the Python demo (regex demo).