Home > Software design >  Python Regex: Select everything between two brackets ignore brackets in between
Python Regex: Select everything between two brackets ignore brackets in between

Time:10-20

I want to capture the entire sub-query irrespective of whether there's a concat or substring function in between (i.e. ignore another bracket opening and closing within sub-query. (a) We don't want to capture "join" as a word (b) "alias2" will not always be followed by "join" it can be anything (a word boundary, space, or "join" word).

Case 1: No concat or sub-string function in select

  • In: (select t1.col1 as alias1 from db.tb where t1.col1='val1') alias2 join

  • Out: (select t1.col1 as alias1 from db.tb where t1.col1='val1') alias2

Case 2: Concat function in select

  • In: (select concat(t1.col1, t2.col1, t3.col1) as alias1 from db.tb where t1.col1='val1') alias2 join

  • Out: (select concat(t1.col1, t2.col1, t3.col1) as alias1 from db.tb where t1.col1='val1') alias2

What I have tried:

Approach 1: re.findall('\(select.*?\)\s[a-zA-Z0-9_] ', input statement)

Approach 2: As suggested by @TheFourthBird

import re
pat1 = '\(select.*?\)\s[a-zA-Z0-9_] '
pat2 = "\(select [^()]*(?:(\((?>[^()] |(?1))*\)))?[^()]*\)[^()\n] "

string1 = "(select t1.col1 as alias1 from db.tb where t1.col1='val1') alias2"
string2 = "(select concat(t1.col1, t2.col1, t3.col1) as alias1 from db.tb where t1.col1='val1') alias2"

print(re.findall(pat1, string1))
print(re.findall(pat1, string2))

import regex as re
print(re.findall(pat2, string1))
print(re.findall(pat2, string2))

pattern = re.compile(pat2, re.UNICODE)
print([match.group(0) for match in pattern.finditer(string2)])

Output:

["(select t1.col1 as alias1 from db.tb where t1.col1='val1') alias2"]
['(select concat(t1.col1, t2.col1, t3.col1) as']
['']
['(t1.col1, t2.col1, t3.col1)']
["(select concat(t1.col1, t2.col1, t3.col1) as alias1 from db.tb where t1.col1='val1') alias2 join "]

What's wrong with the above approach:

  • Approach 1: Works very well for case 1 but doesn't for case 2.

  • Approach 2: Still doesn't work! However, ["(select concat(t1.col1, t2.col1, t3.col1) as alias1 from db.tb where t1.col1='val1') alias2 join "] is the most closet to what's expected. However, it shouldn't capture what's next to alias2.

Please help me!

CodePudding user response:

You could first match (select and optionally match balanced parenthesis using the PyPi regex module.

At the end of the pattern, match a whitespace char and using your character class.

\(select [^()]*(?:(\((?>[^()] |(?1))*\)))?[^()]*\)\s[a-zA-Z0-9_] 

In parts, the pattern matches:

  • \(select Match (select
  • [^()]* optionally match any char except ( and )
  • (?: Non capture group
    • ( Capture group 1
      • \((?>[^()] |(?1))*\) Match ( and use a recursive pattern recursing the first subgroup (capture group 1) and finally match the )
    • ) Close group 1
  • )? Close non capture group and make it optional
  • [^()]*\) Optionally match any char except parenthesis, then match )
  • \s[a-zA-Z0-9_] Match a whitespace char and 1 of the listed in the character class

See a regex demo and a Python demo

For example, using re.finditer (as re.findall returns the capture group values):

import regex as re

pattern = r"\(select [^()]*(?:(\((?>[^()] |(?1))*\)))?[^()]*\)\s[a-zA-Z0-9_] "

s = ("In: (select t1.col1 as alias1 from db.tb where t1.col1='val1') alias2\n\n"
    "Out: (select t1.col1 as alias1 from db.tb where t1.col1='val1') alias2\n\n"
    "In: (select concat(t1.col1, t2.col1, t3.col1) as alias1 from db.tb where t1.col1='val1') alias2\n\n"
    "Out: (select concat(t1.col1, t2.col1, t3.col1) as alias1 from db.tb where t1.col1='val1') alias2\n")

matches = re.finditer(pattern, s)

for matchNum, match in enumerate(matches, start=1):
    
    print(match.group())

Output

["(select t1.col1 as alias1 from db.tb where t1.col1='val1') alias2"]
["(select concat(t1.col1, t2.col1, t3.col1) as alias1 from db.tb where t1.col1='val1') alias2"]

Note that matching SQL is error prone.

  • Related