I want to extract some data using regex. I'm currently using python and the data can be like this:
text [text1 & text2]
text [text1]
text [text2]
text
There is a "text" value (can be one, or more words) and can be followed by []
or no. In case []
exists, the text inside can be two words separated by &
or only one of them (at the moment exists only two possible words but I think is better to get the words by separator instead of matching the exact word)
So my goal is to get the data in two different groups:
- Group1: text
- Group2: [text1, text2] (if exists)
My regex is:
(^([^\[]) )(. ?(?=&|\]))?(. ?(?=\]))?
I've tried this using findall which I think fits better with the result I want:
regex = "(^([^\[]) )(. ?(?=&|\]))?(. ?(?=\]))?"
result = re.findall(regex,text)
But it returns:
[('text ', ' ', '[text1 ', '& text2')]
And I want, something like:
[('text'), ('text1', 'text2')]
I've tried in this playground
How can I improve my regex pattern?
Thanks in advance.
CodePudding user response:
Try this regex:
(?x) # Extended (whitespace and comments are ignored)
([^[] ) # Group 1: 1 or more non-'[' characters
(?: # Start of a non-capturing group 1
\s* # Match optional whitespace
\[ # Match '['
([^&\]] ) # Group 2: 1 or more non- '&' or ']' characters
(?: # Start of a non-capturing group 2
\s* # Match optional whitespace
& # Match '&'
\s* # Match optional whitespace
([^\]] ) # Capture group 3: 1 or more non-']' characters
\s* # Match optional whitespace
)* # End of optional non-capturing group 2
] # Match ']'
)* # End of optional non-capturing group 1
import re
tests = [
'text [text1 & text2]',
'text [text1]',
'text [text2]',
'text'
]
regex = r"""(?x)# Extended (whitespace and comments are ignored)
([^[] ) # Group 1: 1 or more non-'[' characters
(?: # Start of a non-capturing group 1
\s* # Match optional whitespace
\[ # Match '['
([^&\]] ) # Group 2: 1 or more non- '&' or ']' characters
(?: # Start of a non-capturing group 2
\s* # Match optional whitespace
& # Match '&'
\s* # Match optional whitespace
([^\]] ) # Capture group 3: 1 or more non-']' characters
\s* # Match optional whitespace
)* # End of optional non-capturing group 2
] # Match ']'
)* # End of optional non-capturing group 1
"""
for test in tests:
# Here I am doing a fullmatch, but you can do a search to be more lenient:
m = re.fullmatch(regex, test)
if m:
results = m[1], m[2], m[3]
print(test, '->', results)
Prints:
text [text1 & text2] -> ('text ', 'text1 ', 'text2')
text [text1] -> ('text ', 'text1', None)
text [text2] -> ('text ', 'text2', None)
text -> ('text', None, None)
You can, of course, group the results any way you want, such as:
results = (m[1], (m[2] or '' , m[3] or ''))
Prints:
text [text1 & text2] -> ('text ', ('text1 ', 'text2'))
text [text1] -> ('text ', ('text1', ''))
text [text2] -> ('text ', ('text2', ''))
text -> ('text', ('', ''))
Update
This regex will do a better job of automatically trimming whitespace if you know there is always whitespace before the '[' and '&' characters:
(?x) # Extended (whitespace and comments are ignored)
([^[] ) # Group 1: 1 or more non-'[' characters
(?: # Start of a non-capturing group 1
\s # Match whitespace
\[ # Match '['
([^&\]] ) # Group 2: 1 or more non- '&' or ']' characters
(?: # Start of a non-capturing group 2
\s # Match whitespace
& # Match '&'
\s* # Match optional whitespace
([^\]] ) # Capture group 3: 1 or more non-']' characters
\s* # Match optional whitespace
)* # End of optional non-capturing group 2
] # Match ']'
)* # End of optional non-capturing group 1
Code:
import re
tests = [
'text [text1 & text2]',
'text [text1]',
'text [text2]',
'text'
]
regex = r"""(?x)# Extended (whitespace and comments are ignored)
([^[] ) # Group 1: 1 or more non-'[' characters
(?: # Start of a non-capturing group 1
\s # Match whitespace
\[ # Match '['
([^&\]] ) # Group 2: 1 or more non- '&' or ']' characters
(?: # Start of a non-capturing group 2
\s # Match whitespace
& # Match '&'
\s* # Match optional whitespace
([^\]] ) # Capture group 3: 1 or more non-']' characters
\s* # Match optional whitespace
)* # End of optional non-capturing group 2
] # Match ']'
)* # End of optional non-capturing group 1
"""
for test in tests:
# Here I am doing a fullmatch, but you can do a search to be more lenient:
m = re.fullmatch(regex, test)
if m:
results = (m[1], (m[2] or '' , m[3] or ''))
print(test, '->', results)
Prints:
text [text1] -> ('text', ('text1', ''))
text [text2] -> ('text', ('text2', ''))
text -> ('text', ('', ''))
CodePudding user response:
I'm not familiar with python, but I don't think it's possible to get more than one value from one group in a regular expression, so my guess is that you'd have to use something like this
(\w )(?: \[(.*)\])?
Which will give you one or two groups
[('text'), ('text1 & text2')] [('text'), ('text1')] [('text'), ('text2')] [('text')]
And then if the second group exists, use a second regular expression to get each individual value from it.
You can have a look here: https://regex101.com/r/DxF5p1/1