I want to capture the entire sub-query irrespective of whether there's a concat or substring function in between (i.e. ignore another bracket opening and closing within sub-query. (a) We don't want to capture "join" as a word (b) "alias2" will not always be followed by "join" it can be anything (a word boundary, space, or "join" word).
Case 1: No concat or sub-string function in select
In:
(select t1.col1 as alias1 from db.tb where t1.col1='val1') alias2 join
Out:
(select t1.col1 as alias1 from db.tb where t1.col1='val1') alias2
Case 2: Concat function in select
In:
(select concat(t1.col1, t2.col1, t3.col1) as alias1 from db.tb where t1.col1='val1') alias2 join
Out:
(select concat(t1.col1, t2.col1, t3.col1) as alias1 from db.tb where t1.col1='val1') alias2
What I have tried:
Approach 1: re.findall('\(select.*?\)\s[a-zA-Z0-9_] ', input statement)
Approach 2: As suggested by @TheFourthBird
import re
pat1 = '\(select.*?\)\s[a-zA-Z0-9_] '
pat2 = "\(select [^()]*(?:(\((?>[^()] |(?1))*\)))?[^()]*\)[^()\n] "
string1 = "(select t1.col1 as alias1 from db.tb where t1.col1='val1') alias2"
string2 = "(select concat(t1.col1, t2.col1, t3.col1) as alias1 from db.tb where t1.col1='val1') alias2"
print(re.findall(pat1, string1))
print(re.findall(pat1, string2))
import regex as re
print(re.findall(pat2, string1))
print(re.findall(pat2, string2))
pattern = re.compile(pat2, re.UNICODE)
print([match.group(0) for match in pattern.finditer(string2)])
Output:
["(select t1.col1 as alias1 from db.tb where t1.col1='val1') alias2"]
['(select concat(t1.col1, t2.col1, t3.col1) as']
['']
['(t1.col1, t2.col1, t3.col1)']
["(select concat(t1.col1, t2.col1, t3.col1) as alias1 from db.tb where t1.col1='val1') alias2 join "]
What's wrong with the above approach:
Approach 1: Works very well for case 1 but doesn't for case 2.
Approach 2: Still doesn't work! However,
["(select concat(t1.col1, t2.col1, t3.col1) as alias1 from db.tb where t1.col1='val1') alias2 join "]
is the most closet to what's expected. However, it shouldn't capture what's next to alias2.
Please help me!
CodePudding user response:
You could first match (select
and optionally match balanced parenthesis using the PyPi regex module.
At the end of the pattern, match a whitespace char and using your character class.
\(select [^()]*(?:(\((?>[^()] |(?1))*\)))?[^()]*\)\s[a-zA-Z0-9_]
In parts, the pattern matches:
\(select
Match(select
[^()]*
optionally match any char except(
and)
(?:
Non capture group(
Capture group 1\((?>[^()] |(?1))*\)
Match(
and use a recursive pattern recursing the first subgroup (capture group 1) and finally match the)
)
Close group 1
)?
Close non capture group and make it optional[^()]*\)
Optionally match any char except parenthesis, then match)
\s[a-zA-Z0-9_]
Match a whitespace char and 1 of the listed in the character class
See a regex demo and a Python demo
For example, using re.finditer (as re.findall returns the capture group values):
import regex as re
pattern = r"\(select [^()]*(?:(\((?>[^()] |(?1))*\)))?[^()]*\)\s[a-zA-Z0-9_] "
s = ("In: (select t1.col1 as alias1 from db.tb where t1.col1='val1') alias2\n\n"
"Out: (select t1.col1 as alias1 from db.tb where t1.col1='val1') alias2\n\n"
"In: (select concat(t1.col1, t2.col1, t3.col1) as alias1 from db.tb where t1.col1='val1') alias2\n\n"
"Out: (select concat(t1.col1, t2.col1, t3.col1) as alias1 from db.tb where t1.col1='val1') alias2\n")
matches = re.finditer(pattern, s)
for matchNum, match in enumerate(matches, start=1):
print(match.group())
Output
["(select t1.col1 as alias1 from db.tb where t1.col1='val1') alias2"]
["(select concat(t1.col1, t2.col1, t3.col1) as alias1 from db.tb where t1.col1='val1') alias2"]
Note that matching SQL is error prone.