I am trying to check strings in input file and used re
to compose pattern and parsed input file. Here, input is csv file with multiple lines of sql like script, I want to validate strings in order like first check keyword1, then check ketword2 then check keyword3 in each line of input csv file. By doing this, I used for
loop but I feel like there is must be better way to handle this. Does anyone suggest how to tackle with this?
use case
CREATE application vat4_xyz_rcc_clm1_trm_soc WITH some text
CREATE flow flow_src_vat4_xyz_rcc_clm1_trm_soc some text
CREATE stream main_stream_vat4_xyz_rcc_clm1_trm_soc with some text
CREATE OR REPLACE target comp_tgt_vat4_xyz_rcc_clm1_trm_soc some text
to handle this, I tried this:
kword=['CREATE','CREATE OR REPLACE']
with open('myinput.txt', 'r ') as f:
lines = f.readlines()
nlines = [v for v in lines if not v.isspace()]
for line in nlines:
for string in line:
for word in kword:
if string is word:
atype=next(string)
print('type is', atype) # such as application, flow, stream
if atype in ['application', 'flow', 'steam']:
value=next(atype) ## such as vat4_xyz_rcc_clm1_trm_soc, flow_src_vat4_xyz_rcc_clm1_trm_soc
print("name", value)
else:
print("name not found")
else:
print("type is not correct")
but doing this is not efficient code. I think re
might do good job here instead. Does anyone have better idea on this?
objective:
basically, I need to parse each line where if I spot keyword1 such as CREATE
, then check word next to ketword1, if nextword is application
, then print this and check next word of it, where I composed pattern as follow:
vat4_xyz_rcc_clm1_trm_soc
pat1=r'^[\vat\d ]_(?:xyz|abs)_rcc_[clm\d ]_trm_(?:soc|aws)'
m=re.match(pat1, curStr, re.M)
here is case the eachline has differnt pattern such as
pat1=r'^[\vat\d ]_(?:xyz|abs)_rcc_[clm\d ]_trm_(?:soc|aws)'
pat2=r'^\flow_src_[\vat\d ]_(?:xyz|abs)_rcc_[clm\d ]_trm_(?:soc|aws)'
pat3=r'^\main_stream_[\vat\d ]_(?:xyz|abs)_rcc_[clm\d ]_trm_(?:soc|aws)'
pat4=r'^\comp_tgt_[\vat\d ]_(?:xyz|abs)_rcc_[clm\d ]_trm_(?:soc|aws)'
how can we make this as simple as simple for parsing each line with re
? any thoughts?
CodePudding user response:
The regex seems like it can be way simpler than what you're trying. How about this:
import re
matcher = re.compile(r"^(CREATE(?: OR REPLACE)?) (\S ) (\S ).*$")
with open("test.txt", "r") as f:
for line in f:
if match := matcher.match(line):
action, action_type, value = match.groups()
print(f"{action=}, {action_type=}, {value=}")
Outputs:
action='CREATE', action_type='application', value='vat4_xyz_rcc_clm1_trm_soc'
action='CREATE', action_type='flow', value='flow_src_vat4_xyz_rcc_clm1_trm_soc'
action='CREATE', action_type='stream', value='main_stream_vat4_xyz_rcc_clm1_trm_soc'
action='CREATE OR REPLACE', action_type='target', value='comp_tgt_vat4_xyz_rcc_clm1_trm_soc'
If you want to further validate the values, I would take the results from the first regex and pipe them into more specialized regex's for each case.
import re
line_matcher = re.compile(r"^(CREATE(?: OR REPLACE)?) (\S ) (\S ).*$")
value_matchers = {
"application": re.compile(r'^vat\d _(xyz|abs)_rcc_clm\d _trm_(soc|aws)'),
"flow": re.compile(r'^flow_src_vat\d _(xyz|abs)_rcc_clm\d _trm_(soc|aws)'),
"stream": re.compile(r'^main_stream_vat\d _(xyz|abs)_rcc_clm\d _trm_(soc|aws)'),
"target": re.compile(r'^comp_tgt_vat\d _(xyz|abs)_rcc_clm\d _trm_(soc|aws)'),
}
with open("test.txt", "r") as file:
for line in file:
if not (line_match := line_matcher.match(line)):
print(f"Invalid line: {line=}")
continue
action, action_type, value = line_match.groups()
if not (value_matcher := value_matchers.get(action_type)):
print(f"Invalid action type: {line=}")
continue
if not value_matcher.match(value):
print(f"Invalid {action_type} value: {line=}")
continue
# Do some work on the items
...