Home > front end >  Simple way to parse SQL subqueries
Simple way to parse SQL subqueries

Time:05-06

I am working on a SQL analysis tool that, given a RAW SQL SELECT query, can give some sort of analysis. The first version of the tool is finished and can analyze simple RAW queries. However, when the query contains a subquery it breaks.

So I am looking for a simple but reliable way to parse queries and subqueries. My tool must analyze every subquery individually so for example:

Suppose this is the query that the tool is given as input:

SELECT name, email
FROM (SELECT * FROM user WHERE email IS NOT NULL)
WHERE id IN (SELECT cID FROM customer WHERE points > 5)

Then I would like to get a list of queries like so:

queries = [
    "SELECT name, EMAIL FROM <subquery> WHERE id in <subquery>"
    "SELECT * FROM user WHERE email IS NOT NULL"
    "SELECT cID FROM customer WHERE points > 5)"
]

In my first attempt, I am using the fact that subqueries are always written between brackets. So I scan the initial query for brackets. This works when subqueries aren't nested i.e. the are no subqueries inside subqueries. I also experimented a bit with AST, but felt that it was probably a bit too complicated and that there are probably more simple ways.

Anyone who's able to guide me in the right direction? I am using Python, but examples in other languages are also much appreciated.

CodePudding user response:

You can use sqlparse:

import sqlparse
def queries(d):
  if type(d) != sqlparse.sql.Token:
     paren = isinstance(d, sqlparse.sql.Parenthesis)
     v = [queries(i) for i in (d if not paren else d[1:-1])]
     subseq, qrs = ''.join(str(i[0]) for i in v), [x for _, y in v for x in y]
     if [*d][paren].value == 'SELECT':
        return '<subquery>', [subseq] qrs
     return subseq, qrs
  return d, []

s="""SELECT name, email
     FROM (SELECT * FROM user WHERE email IS NOT NULL)
     WHERE id IN (SELECT cID FROM customer WHERE points > 5)
"""
_, subqueries = queries(sqlparse.parse(s)[0])

Output:

['SELECT name, email\n     FROM <subquery>\n     WHERE id IN <subquery>\n', 'SELECT * FROM user WHERE email IS NOT NULL', 'SELECT cID FROM customer WHERE points > 5']

Using the sqlparse library, you can parse a SQL input string into a tokenized stream of keywords, statements, and values. The function queries above takes in a sqlparse.sql.Statement object and searches for any occurrence of a SELECT statement in the query, reformatting the original input along the way to remove subqueries, per the desired output sample.

  • Related