Home > Net >  Python ReGex Pattern Finder
Python ReGex Pattern Finder

Time:06-07

I am trying to get better with ReGex in Python, and I am trying to figure out how I would isolate a specific substring in some text. I have some text, and that text could look like any of the following:

possible_strings = [
    "some text (and words) more text (and words)",
    "textrighthere(with some more)",
    "little trickier (this time) with (all of (the)(values))"
]

With each string, despite the fact that I don't know what's in them, I know it always ends with some information in parentheses. To include examples like #3, where the final pair of parentheses have parentheses in them.

How could I go about using re/ReGex to isolate the text only inside of the last pair of parentheses? So in the previous example, I would want the output to be:

output = [
    "and words",
    "with some more",
    "all of (the)(values)"
]

Any tips or help would be much appreciated!

CodePudding user response:

In python you can use the regex module as it is supports recurssion:

import regex
pat = r'(\((?:[^()]|(?1))*\))$'
regex.findall(pat, '\n'.join(possible_strings), regex.M)

['(and words)', '(with some more)', '(all of (the)(values))']

The regex might be quite complicated for a beginner. Click here for the explanations and examples

Abit of explanation:

( # 1st Capturing Group  
   \( # matches the character ( 
      (?:#Non-capturing group
         [^()] # 1st Alternative Match a single character not present in the character class
         | # or
         (?1) #2nd Alternative matches the expression defined in the 1st capture group recursively
      ) # closes capturing group
     * # matches zero or more times
   \) #matches the character ) 
$ asserts position at the end of a line

CodePudding user response:

For the first two, start matching an opening bracket, that could be either of these:

"some text (and words) more text (and words)"
           ^                     ^

followed by anything which isn't an opening bracket:

"some text (and words) more text (and words)"
           ^^^^^^^^^^^^^^^^^^^^^^X^^^^^^^^^^^
                                 |- starting at the first ( hit 
                                    another ( which isn't allowed.

followed by end of line. Only the last () fits "no more ( until end of line".

>>> import re
>>> re.findall('\([^(] \)$', "some text (and words) more text (and words)")
['(and words)']

RegEx is not a good fit for your third example; there's no easy way to pair up the parens, you may have to install and use a different regex engine to get nested structure support. See also

  • Related