I am trying to get better with ReGex in Python, and I am trying to figure out how I would isolate a specific substring in some text. I have some text, and that text could look like any of the following:
possible_strings = [
"some text (and words) more text (and words)",
"textrighthere(with some more)",
"little trickier (this time) with (all of (the)(values))"
]
With each string, despite the fact that I don't know what's in them, I know it always ends with some information in parentheses. To include examples like #3, where the final pair of parentheses have parentheses in them.
How could I go about using re
/ReGex to isolate the text only inside of the last pair of parentheses? So in the previous example, I would want the output to be:
output = [
"and words",
"with some more",
"all of (the)(values)"
]
Any tips or help would be much appreciated!
CodePudding user response:
In python you can use the regex
module as it is supports recurssion:
import regex
pat = r'(\((?:[^()]|(?1))*\))$'
regex.findall(pat, '\n'.join(possible_strings), regex.M)
['(and words)', '(with some more)', '(all of (the)(values))']
The regex might be quite complicated for a beginner. Click here for the explanations and examples
Abit of explanation:
( # 1st Capturing Group
\( # matches the character (
(?:#Non-capturing group
[^()] # 1st Alternative Match a single character not present in the character class
| # or
(?1) #2nd Alternative matches the expression defined in the 1st capture group recursively
) # closes capturing group
* # matches zero or more times
\) #matches the character )
$ asserts position at the end of a line
CodePudding user response:
For the first two, start matching an opening bracket, that could be either of these:
"some text (and words) more text (and words)"
^ ^
followed by anything which isn't an opening bracket:
"some text (and words) more text (and words)"
^^^^^^^^^^^^^^^^^^^^^^X^^^^^^^^^^^
|- starting at the first ( hit
another ( which isn't allowed.
followed by end of line. Only the last () fits "no more ( until end of line".
>>> import re
>>> re.findall('\([^(] \)$', "some text (and words) more text (and words)")
['(and words)']
RegEx is not a good fit for your third example; there's no easy way to pair up the parens, you may have to install and use a different regex engine to get nested structure support. See also