REGEX : Extracting Table information from connection strings-CodePudding

I am attempting to extract Schema and Table information from connection string data. The Schema and Table information is in the format "Schema.Table" (e.g FROM EDWP_D2PM.SN_INC_RPTG_SCRUBBED in string below) . Multiple Schema and Tables can exist in the connection strings, and they always follow the words FROM or JOIN (Including INNER JOIN, LEFT JOIN etc.). I only want to extract Schema and Tables for Connections from a specific database, in the attached example this is DB2.

"let
    Source = DB2.Database(""69.69.69.69"", ""bcudb"", [HierarchicalNavigation=true, Implementation=""Microsoft"", Query=""SELECT i.ASSIGNMENT_GROUP, i.BUSINESS_SERVICE, i.CATEGORY, i.CAUSED_BY, i.CLOSE_CODE, i.CLOSED_ON, i.COMPANY, i.CONTACT_TYPE, i.DESCRIPTION, i.NUMBER, i.PARENT_INCIDENT, i.PRIORITY, i.RESOLVED_ON, i.SHORT_DESCRIPTION, i.CREATED_ON, i.CAUSE_CODE, i.CLOSED, i.CREATED, i.RESOLVED, i.RESOLUTION_MET, p.problem_id FROM EDWP_D2PM.SN_INC_RPTG_SCRUBBED i LEFT OUTER JOIN EDWP_D2PM.SN_INCIDENTS_CUST_RPTG p on p.number = i.number  WHERE i.PRIORITY INLEFT OUTER JOIN EDWP_D2PM.SN_INCIDENTS_CUST_RPTG ('1 - Critical', '2 - High') AND TO_CHAR(i.created_on,'YYYY-MM') > (select to_char((CURRENT DATE - 12 MONTHS),'YYYY-MM') from SYSIBM.SYSDUMMY1) AND (i.CATEGORY <> 'Alert' OR (i.CATEGORY IS NULL)) AND i.PARENT_INCIDENT IS NULL AND i.EXCLUSIONS <> 'R' AND i.CLOSE_CODE <> 'Duplicate - No Action Taken'  with ur""]), RIGHT JOIN EDWP.TEMP  blaha blah
    #""Changed Type"" = Table.TransformColumnTypes(Source,{{""RESOLVED_ON"", type datetime}})
in
    #""Changed Type"""

I have a REGEX expression that is correctly returning the 4 Schemas.Tables in the sample text as expected (link to regex 101 with working example :

(?:\s JOIN\s )(\w \.\w )|(?:\s FROM\s )(\w \.\w )

I want to improve the REGEX so that that I only get Schemas & Tables back for strings that begin with the text "DB2.Database" . How can I do this?

I have attempted adding in the prefix :

(?:DB2.Database)(?:\s|\S)*

but that stops the 4 Schema.Tables from being returned.

Can anyone suggest a fix? If you do provide an answer, an explaination of the REGEX logic would be appreciated.

Once I have the REGEX working I will run it in Python using the re module.

CodePudding user response：

You can try to put ? after the prefix which will cause the expression not to match as much of the text as possible

(?:DB2.Database)(?:\s|\S)*?

Then the following

(?:DB2.Database)(((?:\s|\S)*?(?:\s (JOIN|FROM)\s )(\w \.\w )) )

would almost work but re module doesn't support repeated captures

As a workaround, running the original regex (?:\s (FROM|JOIN)\s )(\w \.\w ) with the first matched group should give you the desired result.

CodePudding user response：

Simple implemention of the solution proposed by @radof

import re
import pandas

text ="""let
    Source = DB2.Database(""69.69.69.69"", ""bcudb"", [HierarchicalNavigation=true, Implementation=""Microsoft"", Query=""SELECT i.ASSIGNMENT_GROUP, i.BUSINESS_SERVICE, i.CATEGORY, i.CAUSED_BY, i.CLOSE_CODE, i.CLOSED_ON, i.COMPANY, i.CONTACT_TYPE, i.DESCRIPTION, i.NUMBER, i.PARENT_INCIDENT, i.PRIORITY, i.RESOLVED_ON, i.SHORT_DESCRIPTION, i.CREATED_ON, i.CAUSE_CODE, i.CLOSED, i.CREATED, i.RESOLVED, i.RESOLUTION_MET, p.problem_id FROM EDWP_D2PM.SN_INC_RPTG_SCRUBBED i LEFT OUTER JOIN EDWP_D2PM.SN_INCIDENTS_CUST_RPTG p on p.number = i.number  WHERE i.PRIORITY INLEFT OUTER JOIN EDWP_D2PM.SN_INCIDENTS_CUST_RPTG ('1 - Critical', '2 - High') AND TO_CHAR(i.created_on,'YYYY-MM') > (select to_char((CURRENT DATE - 12 MONTHS),'YYYY-MM') from SYSIBM.SYSDUMMY1) AND (i.CATEGORY <> 'Alert' OR (i.CATEGORY IS NULL)) AND i.PARENT_INCIDENT IS NULL AND i.EXCLUSIONS <> 'R' AND i.CLOSE_CODE <> 'Duplicate - No Action Taken'  with ur""]), RIGHT JOIN EDWP.TEMP  blaha blah
    #""Changed Type"" = Table.TransformColumnTypes(Source,{{""RESOLVED_ON"", type datetime
    #filler 
    filler
    filler
    filler
    }})
in
    #""Changed Type"""

DB2_pattern = re.compile(r'(DB2.Database)(?:\s|\S)*', re.IGNORECASE)
Schema_Table = re.compile(r'(?:\s JOIN\s )(\w \.\w )|(?:\s FROM\s )(\w \.\w )', re.IGNORECASE)

DB2_matches = DB2_pattern.finditer(text)
for constring in DB2_matches:
    db2=constring.group()
    #print(db2)
    matches = Schema_Table.finditer(db2)
    for i in matches:
      tables=i.group()
      print(tables)