I am working on a SQL analysis tool that, given a RAW SQL SELECT query, can give some sort of analysis. The first version of the tool is finished and can analyze simple RAW queries. However, when the query contains a subquery it breaks.
So I am looking for a simple but reliable way to parse queries and subqueries. My tool must analyze every subquery individually so for example:
Suppose this is the query that the tool is given as input:
SELECT name, email
FROM (SELECT * FROM user WHERE email IS NOT NULL)
WHERE id IN (SELECT cID FROM customer WHERE points > 5)
Then I would like to get a list of queries like so:
queries = [
"SELECT name, EMAIL FROM <subquery> WHERE id in <subquery>"
"SELECT * FROM user WHERE email IS NOT NULL"
"SELECT cID FROM customer WHERE points > 5)"
]
In my first attempt, I am using the fact that subqueries are always written between brackets. So I scan the initial query for brackets. This works when subqueries aren't nested i.e. the are no subqueries inside subqueries. I also experimented a bit with AST, but felt that it was probably a bit too complicated and that there are probably more simple ways.
Anyone who's able to guide me in the right direction? I am using Python, but examples in other languages are also much appreciated.
CodePudding user response:
You can use sqlparse
:
import sqlparse
def queries(d):
if type(d) != sqlparse.sql.Token:
paren = isinstance(d, sqlparse.sql.Parenthesis)
v = [queries(i) for i in (d if not paren else d[1:-1])]
subseq, qrs = ''.join(str(i[0]) for i in v), [x for _, y in v for x in y]
if [*d][paren].value == 'SELECT':
return '<subquery>', [subseq] qrs
return subseq, qrs
return d, []
s="""SELECT name, email
FROM (SELECT * FROM user WHERE email IS NOT NULL)
WHERE id IN (SELECT cID FROM customer WHERE points > 5)
"""
_, subqueries = queries(sqlparse.parse(s)[0])
Output:
['SELECT name, email\n FROM <subquery>\n WHERE id IN <subquery>\n', 'SELECT * FROM user WHERE email IS NOT NULL', 'SELECT cID FROM customer WHERE points > 5']
Using the sqlparse
library, you can parse a SQL input string into a tokenized stream of keywords, statements, and values. The function queries
above takes in a sqlparse.sql.Statement
object and searches for any occurrence of a SELECT
statement in the query, reformatting the original input along the way to remove subqueries, per the desired output sample.