Home > Software engineering >  Regular expression to match a create table statement
Regular expression to match a create table statement

Time:12-21

I have the following create table statement:

CREATE TABLE db.my_table (
 ts TIMESTAMP OPTIONS (description = "This is 'a' (timestamp)"),
 id INT64 OPTIONS (description = 'id'),
 value STRING
) PARTITION BY ts, id
  CLUSTER BY id
  OPTIONS (description = 'test_(table)');

I want to extract several fields: db, table name, column definitions i.e. everything between (, after db.my_table, and ) before PARTITION BY. Then I want to extract the fields of PARTITION BY, CLUSTER BY and the text of the final description. The PARTITION BY, CLUSTER BY and OPTIONS expressions are optional, that is, any of them can appear or can be missing.

I used the following regex:

CREATE TABLE (\w ).(\w ) \(((?s).*)\)(?:\s PARTITION BY ([^\n\r] ))?(?:\s CLUSTER BY ([^\n\r] ))?(?:\s OPTIONS \(description = ((?<Qt>'''|'|")(.*?)\k<Qt>))?;

Demo regex

This works fine when the last OPTIONS expression is not defined, but when it is, it captures everything until the last ) of OPTIONS. For example, this input is matched correctly:

CREATE TABLE db.my_table (
 ts TIMESTAMP OPTIONS (description = "This is 'a' (timestamp)"),
 id INT64 OPTIONS (description = 'id'),
 value STRING
) PARTITION BY ts, id
  CLUSTER BY id;

How can I fix it, so to match OPTIONS separately?

CodePudding user response:

First of all, regex is not the right tool for parsing SQL. This should be done with a proper SQL parser.

If however, the variation in the input string is limited (as you state in comments), then there are two things you could improve in your regex:

  1. There is a missing \) in your pattern to match what follows in the OPTIONS clause.

  2. To avoid that the third capture group reads beyond the closing bracket of the column definition part, you could take more control over nested brackets and string literals by replacing ((?s).*) with ((?:\((?:"[^"]*"|[^)])*\)|[^)])*)

    This will match (one level of) nested parentheses, and deal with quoted strings inside of them (double quoted), and will stop at any other closing parenthesis.

So:

CREATE TABLE (\w ).(\w ) \(((?:\((?:"[^"]*"|[^)])*\)|[^)])*)\)(?:\s PARTITION BY ([^\n\r] ))?(?:\s CLUSTER BY ([^\n\r] ))?(?:\s OPTIONS \(description = ((?<Qt>'''|'|")(.*?)\k<Qt>)\))?;
  • Related