I am trying to parse input txt file where I want to check each line start with possible multiple key words and do incremental. The input text is sql like txt and I am checking multiple key words, if found it to check the pattern in next to it. To do so, I got some help and able to parse some. But when I tried to add other possible key words, some of the lines were not parsed, and getting value error instead. I checked with re
documentation, understood how to check line start with multiple strings, but still end up with some value error. Can anyone able to take a look and correct me?
Use case:
So each line, I am expecting to see one of these as starting of each line: CREATE
, CREATE OR REPLACE
, END
, INPUT
, INPUT FROM
, INSERT INTO
, so I do this by using @flakes help in previous post:
import re, os, sys
matcher = re.compile(r"^[(CREATE(?: OR REPLACE)?) | END | (INPUT (?: FROM)) | (INSERT(?:INTO))] (\S ) (\S ).*$")
with open('input.txt', 'r ') as f:
lines = f.readlines()
nlines = [v for v in lines if not v.isspace()]
for line in nlines:
if match := matcher.match(line):
component, component_type, component_name = match.groups()
print(f"{component=},{component_type=}, {component_name=}")
But nothing is printed out, I check this pattern at https://regex101.com but end up with this value error:
ValueError: not enough values to unpack (expected 3, got 2)
I used this input file for parsing using re
.
objective
Basically, I need to check each line to see whether it start with CREATE
, CREATE OR REPLACE
, END
, INPUT
, INPUT FROM
, INSERT INTO
, if yes, then print next word, check pattern of next next word of the key word. Can anyone point me out what went wrong with the pattern that I composed?
CodePudding user response:
If you always want to return 3 groups, you can make the last group capture optional non whitespace chars as that value is not always present in the example data.
You should use a single capture group for all the variants at the start of the string, and put the optional groups in a non capture group.
Note that you should not put the intented groupings in square bracktes, as they will change the meaning to a character class matching the characters that are listed in it.
The updated pattern:
^(CREATE(?: OR REPLACE)?|END|INPUT FROM|INSERT INTO) (\S ) ?(\S*)
See the capture groups in the regex demo
import re
matcher = re.compile(r"^(CREATE(?: OR REPLACE)?|END|INPUT FROM|INSERT INTO) (\S ) ?(\S*)")
with open('input.txt', 'r ') as f:
lines = f.readlines()
nlines = [v for v in lines if not v.isspace()]
for line in nlines:
if match := matcher.match(line):
component, component_type, component_name = match.groups()
print(f"{component=},{component_type=}, {component_name=}")
Output
component='CREATE',component_type='application', component_name='vat4_sac_rec_cln1_tst_noi'
component='CREATE',component_type='FLOW', component_name='flow_src_vat4_sac_rec_cln1_tst_noi;'
component='CREATE OR REPLACE',component_type='SOURCE', component_name='comp_src_vat4_sac_rec_cln1_tst_noi'
component='END',component_type='FLOW', component_name='flow_src_vat4_sac_rec_cln1_tst_noi;'
component='CREATE',component_type='FLOW', component_name='flow_tgt_vat4_sac_rec_cln1_tst_noi;'
component='CREATE',component_type='STREAM', component_name='vat4_sac_rec_cln1_tst_noi_text_clean_stream'
component='CREATE',component_type='STREAM', component_name='CQ_STREAM_vat4_sac_rec_cln1_tst_noi'
component='CREATE OR REPLACE',component_type='TARGET', component_name='comp_tgt_vat4_sac_rec_cln1_tst_noi'
component='INPUT FROM',component_type='vat4_sac_rec_cln1_tst_noi_text_clean_stream;', component_name=''
component='CREATE OR REPLACE',component_type='CQ', component_name='CQ_vat4_sac_rec_cln1_tst_noi'
component='INSERT INTO',component_type='CQ_STREAM_vat4_sac_rec_cln1_tst_noi', component_name=''
component='CREATE OR REPLACE',component_type='OPEN', component_name='PROCESSOR'
component='INSERT INTO',component_type='vat4_sac_rec_cln1_tst_noi_text_clean_stream', component_name=''
component='END',component_type='FLOW', component_name='flow_tgt_vat4_sac_rec_cln1_tst_noi;'
component='END',component_type='application', component_name='vat4_sac_rec_cln1_tst_noi;'