I am attempting to extract Schema and Table information from connection string data. The Schema and Table information is in the format "Schema.Table" (e.g FROM EDWP_D2PM.SN_INC_RPTG_SCRUBBED in string below) . Multiple Schema and Tables can exist in the connection strings, and they always follow the words FROM or JOIN (Including INNER JOIN, LEFT JOIN etc.). I only want to extract Schema and Tables for Connections from a specific database, in the attached example this is DB2.
"let
Source = DB2.Database(""69.69.69.69"", ""bcudb"", [HierarchicalNavigation=true, Implementation=""Microsoft"", Query=""SELECT i.ASSIGNMENT_GROUP, i.BUSINESS_SERVICE, i.CATEGORY, i.CAUSED_BY, i.CLOSE_CODE, i.CLOSED_ON, i.COMPANY, i.CONTACT_TYPE, i.DESCRIPTION, i.NUMBER, i.PARENT_INCIDENT, i.PRIORITY, i.RESOLVED_ON, i.SHORT_DESCRIPTION, i.CREATED_ON, i.CAUSE_CODE, i.CLOSED, i.CREATED, i.RESOLVED, i.RESOLUTION_MET, p.problem_id FROM EDWP_D2PM.SN_INC_RPTG_SCRUBBED i LEFT OUTER JOIN EDWP_D2PM.SN_INCIDENTS_CUST_RPTG p on p.number = i.number WHERE i.PRIORITY INLEFT OUTER JOIN EDWP_D2PM.SN_INCIDENTS_CUST_RPTG ('1 - Critical', '2 - High') AND TO_CHAR(i.created_on,'YYYY-MM') > (select to_char((CURRENT DATE - 12 MONTHS),'YYYY-MM') from SYSIBM.SYSDUMMY1) AND (i.CATEGORY <> 'Alert' OR (i.CATEGORY IS NULL)) AND i.PARENT_INCIDENT IS NULL AND i.EXCLUSIONS <> 'R' AND i.CLOSE_CODE <> 'Duplicate - No Action Taken' with ur""]), RIGHT JOIN EDWP.TEMP blaha blah
#""Changed Type"" = Table.TransformColumnTypes(Source,{{""RESOLVED_ON"", type datetime}})
in
#""Changed Type"""
I have a REGEX expression that is correctly returning the 4 Schemas.Tables in the sample text as expected (link to regex 101 with working example :
(?:\s JOIN\s )(\w \.\w )|(?:\s FROM\s )(\w \.\w )
I want to improve the REGEX so that that I only get Schemas & Tables back for strings that begin with the text "DB2.Database" . How can I do this?
I have attempted adding in the prefix :
(?:DB2.Database)(?:\s|\S)*
but that stops the 4 Schema.Tables from being returned.
Can anyone suggest a fix? If you do provide an answer, an explaination of the REGEX logic would be appreciated.
Once I have the REGEX working I will run it in Python using the re module.
CodePudding user response:
You can try to put ?
after the prefix which will cause the expression not to match as much of the text as possible
(?:DB2.Database)(?:\s|\S)*?
Then the following
(?:DB2.Database)(((?:\s|\S)*?(?:\s (JOIN|FROM)\s )(\w \.\w )) )
would almost work but re
module doesn't support repeated captures
As a workaround, running the original regex (?:\s (FROM|JOIN)\s )(\w \.\w )
with the first matched group should give you the desired result.
CodePudding user response:
Simple implemention of the solution proposed by @radof
import re
import pandas
text ="""let
Source = DB2.Database(""69.69.69.69"", ""bcudb"", [HierarchicalNavigation=true, Implementation=""Microsoft"", Query=""SELECT i.ASSIGNMENT_GROUP, i.BUSINESS_SERVICE, i.CATEGORY, i.CAUSED_BY, i.CLOSE_CODE, i.CLOSED_ON, i.COMPANY, i.CONTACT_TYPE, i.DESCRIPTION, i.NUMBER, i.PARENT_INCIDENT, i.PRIORITY, i.RESOLVED_ON, i.SHORT_DESCRIPTION, i.CREATED_ON, i.CAUSE_CODE, i.CLOSED, i.CREATED, i.RESOLVED, i.RESOLUTION_MET, p.problem_id FROM EDWP_D2PM.SN_INC_RPTG_SCRUBBED i LEFT OUTER JOIN EDWP_D2PM.SN_INCIDENTS_CUST_RPTG p on p.number = i.number WHERE i.PRIORITY INLEFT OUTER JOIN EDWP_D2PM.SN_INCIDENTS_CUST_RPTG ('1 - Critical', '2 - High') AND TO_CHAR(i.created_on,'YYYY-MM') > (select to_char((CURRENT DATE - 12 MONTHS),'YYYY-MM') from SYSIBM.SYSDUMMY1) AND (i.CATEGORY <> 'Alert' OR (i.CATEGORY IS NULL)) AND i.PARENT_INCIDENT IS NULL AND i.EXCLUSIONS <> 'R' AND i.CLOSE_CODE <> 'Duplicate - No Action Taken' with ur""]), RIGHT JOIN EDWP.TEMP blaha blah
#""Changed Type"" = Table.TransformColumnTypes(Source,{{""RESOLVED_ON"", type datetime
#filler
filler
filler
filler
}})
in
#""Changed Type"""
DB2_pattern = re.compile(r'(DB2.Database)(?:\s|\S)*', re.IGNORECASE)
Schema_Table = re.compile(r'(?:\s JOIN\s )(\w \.\w )|(?:\s FROM\s )(\w \.\w )', re.IGNORECASE)
DB2_matches = DB2_pattern.finditer(text)
for constring in DB2_matches:
db2=constring.group()
#print(db2)
matches = Schema_Table.finditer(db2)
for i in matches:
tables=i.group()
print(tables)