Python Regex: Select everything between two brackets ignore brackets in between-CodePudding

I want to capture the entire sub-query irrespective of whether there's a concat or substring function in between (i.e. ignore another bracket opening and closing within sub-query. (a) We don't want to capture "join" as a word (b) "alias2" will not always be followed by "join" it can be anything (a word boundary, space, or "join" word).

Case 1: No concat or sub-string function in select

In: (select t1.col1 as alias1 from db.tb where t1.col1='val1') alias2 join
Out: (select t1.col1 as alias1 from db.tb where t1.col1='val1') alias2

Case 2: Concat function in select

In: (select concat(t1.col1, t2.col1, t3.col1) as alias1 from db.tb where t1.col1='val1') alias2 join
Out: (select concat(t1.col1, t2.col1, t3.col1) as alias1 from db.tb where t1.col1='val1') alias2

What I have tried:

Approach 1: re.findall('\(select.*?\)\s[a-zA-Z0-9_] ', input statement)

Approach 2: As suggested by @TheFourthBird

import re
pat1 = '\(select.*?\)\s[a-zA-Z0-9_] '
pat2 = "\(select [^()]*(?:(\((?>[^()] |(?1))*\)))?[^()]*\)[^()\n] "

string1 = "(select t1.col1 as alias1 from db.tb where t1.col1='val1') alias2"
string2 = "(select concat(t1.col1, t2.col1, t3.col1) as alias1 from db.tb where t1.col1='val1') alias2"

print(re.findall(pat1, string1))
print(re.findall(pat1, string2))

import regex as re
print(re.findall(pat2, string1))
print(re.findall(pat2, string2))

pattern = re.compile(pat2, re.UNICODE)
print([match.group(0) for match in pattern.finditer(string2)])

Output:

["(select t1.col1 as alias1 from db.tb where t1.col1='val1') alias2"]
['(select concat(t1.col1, t2.col1, t3.col1) as']
['']
['(t1.col1, t2.col1, t3.col1)']
["(select concat(t1.col1, t2.col1, t3.col1) as alias1 from db.tb where t1.col1='val1') alias2 join "]

What's wrong with the above approach:

Approach 1: Works very well for case 1 but doesn't for case 2.
Approach 2: Still doesn't work! However, ["(select concat(t1.col1, t2.col1, t3.col1) as alias1 from db.tb where t1.col1='val1') alias2 join "] is the most closet to what's expected. However, it shouldn't capture what's next to alias2.

Please help me!

CodePudding user response：

You could first match (select and optionally match balanced parenthesis using the PyPi regex module.

At the end of the pattern, match a whitespace char and using your character class.

\(select [^()]*(?:(\((?>[^()] |(?1))*\)))?[^()]*\)\s[a-zA-Z0-9_]

In parts, the pattern matches:

\(select Match (select
[^()]* optionally match any char except ( and )
(?: Non capture group
- ( Capture group 1
  - \((?>[^()] |(?1))*\) Match ( and use a recursive pattern recursing the first subgroup (capture group 1) and finally match the )
- ) Close group 1
)? Close non capture group and make it optional
[^()]*\) Optionally match any char except parenthesis, then match )
\s[a-zA-Z0-9_] Match a whitespace char and 1 of the listed in the character class

See a regex demo and a Python demo

For example, using re.finditer (as re.findall returns the capture group values):

import regex as re

pattern = r"\(select [^()]*(?:(\((?>[^()] |(?1))*\)))?[^()]*\)\s[a-zA-Z0-9_] "

s = ("In: (select t1.col1 as alias1 from db.tb where t1.col1='val1') alias2\n\n"
    "Out: (select t1.col1 as alias1 from db.tb where t1.col1='val1') alias2\n\n"
    "In: (select concat(t1.col1, t2.col1, t3.col1) as alias1 from db.tb where t1.col1='val1') alias2\n\n"
    "Out: (select concat(t1.col1, t2.col1, t3.col1) as alias1 from db.tb where t1.col1='val1') alias2\n")

matches = re.finditer(pattern, s)

for matchNum, match in enumerate(matches, start=1):
    
    print(match.group())

Output

["(select t1.col1 as alias1 from db.tb where t1.col1='val1') alias2"]
["(select concat(t1.col1, t2.col1, t3.col1) as alias1 from db.tb where t1.col1='val1') alias2"]

Note that matching SQL is error prone.