Home > Enterprise >  What are the gotchas when converting a T-SQL statement into a JavaScript RegExp
What are the gotchas when converting a T-SQL statement into a JavaScript RegExp

Time:12-09

I have a large number of T-SQL statements logged from a server I manage. I'm trying to boil them down to one instance of each.

Here's one of them:

SELECT TBLLANGUAGE.NAME AS NAME1, TBLLANGUAGE_1.NAME AS NAME2, 
TBLLANGUAGELANGUAGE.LNGFKCHILD, TBLLANGUAGELANGUAGE.LNGFKPARENT, 
TBLLANGUAGELANGUAGE.STYLE, TBLLANGUAGELANGUAGE.EXTENT, 
TBLLANGUAGELANGUAGE.NATURE, TBLSOURCE.TXTTITLE, TBLSOURCE_1.TXTTITLE AS 
SURTITLE FROM ((((TBLLANGUAGE LEFT JOIN TBLLANGUAGELANGUAGE ON 
TBLLANGUAGE.ID = TBLLANGUAGELANGUAGE.LNGFKPARENT) LEFT JOIN TBLLANGUAGE 
AS TBLLANGUAGE_1 ON TBLLANGUAGELANGUAGE.LNGFKCHILD = TBLLANGUAGE_1.ID) 
LEFT JOIN TBLLANGLANGSOURCE ON TBLLANGUAGELANGUAGE.IDLANGLINK = 
TBLLANGLANGSOURCE.LNGFKLANGLINK) LEFT JOIN TBLSOURCE ON 
TBLLANGLANGSOURCE.LNGFKSOURCE = TBLSOURCE.IDSOURCE) LEFT JOIN TBLSOURCE 
AS TBLSOURCE_1 ON TBLSOURCE.LNGPARTOF = TBLSOURCE_1.IDSOURCE WHERE 
(((TBLLANGUAGELANGUAGE.LNGFKPARENT) = 8687)) OR 
(((TBLLANGUAGELANGUAGE.LNGFKCHILD) = 8687)) ORDER BY 
IIF(TBLLANGUAGELANGUAGE.LNGFKPARENT = 8687,'B','A'), TBLLANGUAGE.NAME, 
TBLLANGUAGE_1.NAME;

I want to convert that to a JavaScript RegExp, substituting runs of digits for \d and stuff between apostrophes into '.*'.

So far I've got this far with Deno:

function getPattern(text: string): string {
  text = text.replace(/\(/g, "\\x28")
    .replace(/\)/g, "\\x29")
    .replace(/\$/g, "\\x24")
    .replace(/\^/g, "\\x5e")
    .replace(/\./g, "\\x2e")
    .replace(/\*/g, "\\x2a")
    .replace(/\[/g, "\\x5b")
    .replace(/\]/g, "\\x5d")
    .replace(/\?/g, "\\x3f");

  [ "\\<\s\\>", "\\<", "\\<=", "=", "\\>=", "\\>"].forEach((op) => {
    const numberPattern = new RegExp(`\\s${op}\\s(\\d )`, "g");
    text.match(numberPattern)?.forEach((e) => {
      text = text.replace(e, ` ${op} \\d `);
    });
  });

  //const textPattern = /'[^']*'\s/g;
  const textPattern = /\s*'.*'\s*/g;
  text.match(textPattern)?.forEach((e) => {
    //const eLength = e.length;
    text = text.replace(e, "\\s*'.*'\\s*");
  });

  return text; //.replace(/\</g, "\\x3c")
    //.replace(/\>/g, "\\x3e");
}

This renders the above statement as

SELECT TBLLANGUAGE\x2eNAME AS NAME1, TBLLANGUAGE_1\x2eNAME AS NAME2, 
TBLLANGUAGELANGUAGE\x2eLNGFKCHILD, TBLLANGUAGELANGUAGE\x2eLNGFKPARENT, 
TBLLANGUAGELANGUAGE\x2eSTYLE, TBLLANGUAGELANGUAGE\x2eEXTENT, 
TBLLANGUAGELANGUAGE\x2eNATURE, TBLSOURCE\x2eTXTTITLE, 
TBLSOURCE_1\x2eTXTTITLE AS SURTITLE FROM \x28\x28\x28\x28TBLLANGUAGE 
LEFT JOIN TBLLANGUAGELANGUAGE ON TBLLANGUAGE\x2eID = 
TBLLANGUAGELANGUAGE\x2eLNGFKPARENT\x29 LEFT JOIN TBLLANGUAGE AS 
TBLLANGUAGE_1 ON TBLLANGUAGELANGUAGE\x2eLNGFKCHILD = 
TBLLANGUAGE_1\x2eID\x29 LEFT JOIN TBLLANGLANGSOURCE ON 
TBLLANGUAGELANGUAGE\x2eIDLANGLINK = 
TBLLANGLANGSOURCE\x2eLNGFKLANGLINK\x29 LEFT JOIN TBLSOURCE ON 
TBLLANGLANGSOURCE\x2eLNGFKSOURCE = TBLSOURCE\x2eIDSOURCE\x29 LEFT JOIN 
TBLSOURCE AS TBLSOURCE_1 ON TBLSOURCE\x2eLNGPARTOF = 
TBLSOURCE_1\x2eIDSOURCE WHERE 
\x28\x28\x28TBLLANGUAGELANGUAGE\x2eLNGFKPARENT\x29 = \d \x29\x29 OR 
\x28\x28\x28TBLLANGUAGELANGUAGE\x2eLNGFKCHILD\x29 = \d \x29\x29 ORDER 
BY IIF\x28TBLLANGUAGELANGUAGE\x2eLNGFKPARENT = \d ,\s*'.*'\s*\x29, 
TBLLANGUAGE\x2eNAME, TBLLANGUAGE_1\x2eNAME;

I'm converting various components to their \xnn forms because the way I'm reading the documentation, apparently new RegExp() isn't smart enough to see an embedded ( and not think I'm defining a group in the regular expression. That is, it doesn't seem to be sufficient simply to say

const pattern = new RegExp("SELECT TBLLANGUAGE.NAME (etcetera)","gi");

Am I reading the docs wrong and is there a better way? And no, I don't want to write a T-SQL parser unless there's a really, really good reason.

SOMETIME LATER

I've essentially solved my problem, and by using a different pattern matching approach. Please see Extracting example SQL statements from a log up on DEV.

CodePudding user response:

I don't fully understand what you're trying to achieve but if it's:

convert this SQL statement into a valid regex which can find other SQL like it

then this would do it:

var sql = `SELECT TBLLANGUAGE.NAME AS NAME1, TBLLANGUAGE_1.NAME AS NAME2, 
TBLLANGUAGELANGUAGE.LNGFKCHILD, TBLLANGUAGELANGUAGE.LNGFKPARENT, 
TBLLANGUAGELANGUAGE.STYLE, TBLLANGUAGELANGUAGE.EXTENT, 
TBLLANGUAGELANGUAGE.NATURE, TBLSOURCE.TXTTITLE, TBLSOURCE_1.TXTTITLE AS 
SURTITLE FROM ((((TBLLANGUAGE LEFT JOIN TBLLANGUAGELANGUAGE ON 
TBLLANGUAGE.ID = TBLLANGUAGELANGUAGE.LNGFKPARENT) LEFT JOIN TBLLANGUAGE 
AS TBLLANGUAGE_1 ON TBLLANGUAGELANGUAGE.LNGFKCHILD = TBLLANGUAGE_1.ID) 
LEFT JOIN TBLLANGLANGSOURCE ON TBLLANGUAGELANGUAGE.IDLANGLINK = 
TBLLANGLANGSOURCE.LNGFKLANGLINK) LEFT JOIN TBLSOURCE ON 
TBLLANGLANGSOURCE.LNGFKSOURCE = TBLSOURCE.IDSOURCE) LEFT JOIN TBLSOURCE 
AS TBLSOURCE_1 ON TBLSOURCE.LNGPARTOF = TBLSOURCE_1.IDSOURCE WHERE 
(((TBLLANGUAGELANGUAGE.LNGFKPARENT) = 8687)) OR 
(((TBLLANGUAGELANGUAGE.LNGFKCHILD) = 8687)) ORDER BY 
IIF(TBLLANGUAGELANGUAGE.LNGFKPARENT = 8687,'B','A'), TBLLANGUAGE.NAME, 
TBLLANGUAGE_1.NAME;`;

// First replace: account for JS regex special chars and escape with backslash to make them literal
// Second replace: get everything between single quotes and make it . ?
// Third replace: get all digit sequences and make them \d 
// Fourth replace: get all whitespace sequences and make them \s 
var sql_regex = sql.replace( /[.* ?^${}()|[\]\\]/g, '\\$&' )
                   .replace( /('. ?')/g, '\'. ?\'' )
                   .replace( /\d /g, '\\d ' )
                   .replace( /\s /g, '\\s ' );

console.log( sql_regex );

// Test if our regex matches the string it was built from
console.log( new RegExp( sql_regex, 'g' ).test( sql ) );

Value of sql_regex:

SELECT\s TBLLANGUAGE\.NAME\s AS\s NAME\d ,\s TBLLANGUAGE_\d \.NAME
\s AS\s NAME\d ,\s TBLLANGUAGELANGUAGE\.LNGFKCHILD,
\s TBLLANGUAGELANGUAGE\.LNGFKPARENT,\s TBLLANGUAGELANGUAGE\.STYLE,
\s TBLLANGUAGELANGUAGE\.EXTENT,\s TBLLANGUAGELANGUAGE\.NATURE,
\s TBLSOURCE\.TXTTITLE,\s TBLSOURCE_\d \.TXTTITLE\s AS\s SURTITLE
\s FROM\s \(\(\(\(TBLLANGUAGE\s LEFT\s JOIN\s TBLLANGUAGELANGUAGE\s ON
\s TBLLANGUAGE\.ID\s =\s TBLLANGUAGELANGUAGE\.LNGFKPARENT\)\s LEFT
\s JOIN\s TBLLANGUAGE\s AS\s TBLLANGUAGE_\d \s ON
\s TBLLANGUAGELANGUAGE\.LNGFKCHILD\s =\s TBLLANGUAGE_\d \.ID\)\s LEFT
\s JOIN\s TBLLANGLANGSOURCE\s ON\s TBLLANGUAGELANGUAGE\.IDLANGLINK\s =
\s TBLLANGLANGSOURCE\.LNGFKLANGLINK\)\s LEFT\s JOIN\s TBLSOURCE\s ON
\s TBLLANGLANGSOURCE\.LNGFKSOURCE\s =\s TBLSOURCE\.IDSOURCE\)\s LEFT
\s JOIN\s TBLSOURCE\s AS\s TBLSOURCE_\d \s ON\s TBLSOURCE\.LNGPARTOF
\s =\s TBLSOURCE_\d \.IDSOURCE\s WHERE
\s \(\(\(TBLLANGUAGELANGUAGE\.LNGFKPARENT\)\s =\s \d \)\)\s OR
\s \(\(\(TBLLANGUAGELANGUAGE\.LNGFKCHILD\)\s =\s \d \)\)\s ORDER\s BY
\s IIF\(TBLLANGUAGELANGUAGE\.LNGFKPARENT\s =\s \d ,'. ?','. ?'\),
\s TBLLANGUAGE\.NAME,\s TBLLANGUAGE_\d \.NAME;

Note: new lines are superficial and were only added for readability

  • Related