how to parse line start with multiple key words in python?-CodePudding

I am trying to parse input txt file where I want to check each line start with possible multiple key words and do incremental. The input text is sql like txt and I am checking multiple key words, if found it to check the pattern in next to it. To do so, I got some help and able to parse some. But when I tried to add other possible key words, some of the lines were not parsed, and getting value error instead. I checked with re documentation, understood how to check line start with multiple strings, but still end up with some value error. Can anyone able to take a look and correct me?

Use case:

So each line, I am expecting to see one of these as starting of each line: CREATE, CREATE OR REPLACE, END, INPUT, INPUT FROM, INSERT INTO, so I do this by using @flakes help in previous post:

import re, os, sys
    
matcher = re.compile(r"^[(CREATE(?: OR REPLACE)?) | END | (INPUT (?: FROM)) | (INSERT(?:INTO))] (\S ) (\S ).*$")

with open('input.txt', 'r ') as f:
    lines = f.readlines()
    nlines = [v for v in lines if not v.isspace()]
    for line in nlines:
        if match := matcher.match(line):
            component, component_type, component_name = match.groups()
            print(f"{component=},{component_type=}, {component_name=}")

But nothing is printed out, I check this pattern at https://regex101.com but end up with this value error:

ValueError: not enough values to unpack (expected 3, got 2)

I used this input file for parsing using re.

objective

Basically, I need to check each line to see whether it start with CREATE, CREATE OR REPLACE, END, INPUT, INPUT FROM, INSERT INTO, if yes, then print next word, check pattern of next next word of the key word. Can anyone point me out what went wrong with the pattern that I composed?

CodePudding user response：

If you always want to return 3 groups, you can make the last group capture optional non whitespace chars as that value is not always present in the example data.

You should use a single capture group for all the variants at the start of the string, and put the optional groups in a non capture group.

Note that you should not put the intented groupings in square bracktes, as they will change the meaning to a character class matching the characters that are listed in it.

The updated pattern:

^(CREATE(?: OR REPLACE)?|END|INPUT FROM|INSERT INTO) (\S ) ?(\S*)

See the capture groups in the regex demo

import re

matcher = re.compile(r"^(CREATE(?: OR REPLACE)?|END|INPUT FROM|INSERT INTO) (\S ) ?(\S*)")

with open('input.txt', 'r ') as f:
    lines = f.readlines()
    nlines = [v for v in lines if not v.isspace()]
    for line in nlines:
        if match := matcher.match(line):
            component, component_type, component_name = match.groups()
            print(f"{component=},{component_type=}, {component_name=}")

Output

component='CREATE',component_type='application', component_name='vat4_sac_rec_cln1_tst_noi'
component='CREATE',component_type='FLOW', component_name='flow_src_vat4_sac_rec_cln1_tst_noi;'
component='CREATE OR REPLACE',component_type='SOURCE', component_name='comp_src_vat4_sac_rec_cln1_tst_noi'
component='END',component_type='FLOW', component_name='flow_src_vat4_sac_rec_cln1_tst_noi;'
component='CREATE',component_type='FLOW', component_name='flow_tgt_vat4_sac_rec_cln1_tst_noi;'
component='CREATE',component_type='STREAM', component_name='vat4_sac_rec_cln1_tst_noi_text_clean_stream'
component='CREATE',component_type='STREAM', component_name='CQ_STREAM_vat4_sac_rec_cln1_tst_noi'
component='CREATE OR REPLACE',component_type='TARGET', component_name='comp_tgt_vat4_sac_rec_cln1_tst_noi'
component='INPUT FROM',component_type='vat4_sac_rec_cln1_tst_noi_text_clean_stream;', component_name=''
component='CREATE OR REPLACE',component_type='CQ', component_name='CQ_vat4_sac_rec_cln1_tst_noi'
component='INSERT INTO',component_type='CQ_STREAM_vat4_sac_rec_cln1_tst_noi', component_name=''
component='CREATE OR REPLACE',component_type='OPEN', component_name='PROCESSOR'
component='INSERT INTO',component_type='vat4_sac_rec_cln1_tst_noi_text_clean_stream', component_name=''
component='END',component_type='FLOW', component_name='flow_tgt_vat4_sac_rec_cln1_tst_noi;'
component='END',component_type='application', component_name='vat4_sac_rec_cln1_tst_noi;'