any better way to validate string pattern after the key word found in txt file?-CodePudding

I am trying to check strings in input file and used re to compose pattern and parsed input file. Here, input is csv file with multiple lines of sql like script, I want to validate strings in order like first check keyword1, then check ketword2 then check keyword3 in each line of input csv file. By doing this, I used for loop but I feel like there is must be better way to handle this. Does anyone suggest how to tackle with this?

use case

CREATE application vat4_xyz_rcc_clm1_trm_soc WITH some text
CREATE flow flow_src_vat4_xyz_rcc_clm1_trm_soc some text
CREATE stream main_stream_vat4_xyz_rcc_clm1_trm_soc with some text
CREATE OR REPLACE target comp_tgt_vat4_xyz_rcc_clm1_trm_soc some text

to handle this, I tried this:

kword=['CREATE','CREATE OR REPLACE']
    with open('myinput.txt', 'r ') as f:
        lines = f.readlines()
        nlines = [v for v in lines if not v.isspace()]
        for line in nlines:
            for string in line:
                for word in kword:
                    if string is word:
                        atype=next(string)
                        print('type is', atype)  # such as application, flow, stream
                        
                        if atype in ['application', 'flow', 'steam']:
                            value=next(atype)  ## such as vat4_xyz_rcc_clm1_trm_soc, flow_src_vat4_xyz_rcc_clm1_trm_soc
                            print("name", value)
                        else:
                            print("name not found")
                    else:
                        print("type is not correct")

but doing this is not efficient code. I think re might do good job here instead. Does anyone have better idea on this?

objective:

basically, I need to parse each line where if I spot keyword1 such as CREATE, then check word next to ketword1, if nextword is application, then print this and check next word of it, where I composed pattern as follow:

vat4_xyz_rcc_clm1_trm_soc
pat1=r'^[\vat\d ]_(?:xyz|abs)_rcc_[clm\d ]_trm_(?:soc|aws)'
m=re.match(pat1, curStr, re.M)

here is case the eachline has differnt pattern such as

pat1=r'^[\vat\d ]_(?:xyz|abs)_rcc_[clm\d ]_trm_(?:soc|aws)'
pat2=r'^\flow_src_[\vat\d ]_(?:xyz|abs)_rcc_[clm\d ]_trm_(?:soc|aws)'
pat3=r'^\main_stream_[\vat\d ]_(?:xyz|abs)_rcc_[clm\d ]_trm_(?:soc|aws)'
pat4=r'^\comp_tgt_[\vat\d ]_(?:xyz|abs)_rcc_[clm\d ]_trm_(?:soc|aws)'

how can we make this as simple as simple for parsing each line with re? any thoughts?

CodePudding user response：

The regex seems like it can be way simpler than what you're trying. How about this:

import re

matcher = re.compile(r"^(CREATE(?: OR REPLACE)?) (\S ) (\S ).*$")

with open("test.txt", "r") as f:
    for line in f:
        if match := matcher.match(line):
            action, action_type, value = match.groups()
            print(f"{action=}, {action_type=}, {value=}")

Outputs:

action='CREATE', action_type='application', value='vat4_xyz_rcc_clm1_trm_soc'
action='CREATE', action_type='flow', value='flow_src_vat4_xyz_rcc_clm1_trm_soc'
action='CREATE', action_type='stream', value='main_stream_vat4_xyz_rcc_clm1_trm_soc'
action='CREATE OR REPLACE', action_type='target', value='comp_tgt_vat4_xyz_rcc_clm1_trm_soc'

If you want to further validate the values, I would take the results from the first regex and pipe them into more specialized regex's for each case.

import re

line_matcher = re.compile(r"^(CREATE(?: OR REPLACE)?) (\S ) (\S ).*$")
value_matchers = {
    "application": re.compile(r'^vat\d _(xyz|abs)_rcc_clm\d _trm_(soc|aws)'),
    "flow": re.compile(r'^flow_src_vat\d _(xyz|abs)_rcc_clm\d _trm_(soc|aws)'),
    "stream": re.compile(r'^main_stream_vat\d _(xyz|abs)_rcc_clm\d _trm_(soc|aws)'),
    "target": re.compile(r'^comp_tgt_vat\d _(xyz|abs)_rcc_clm\d _trm_(soc|aws)'),
}

with open("test.txt", "r") as file:
    for line in file:
        if not (line_match := line_matcher.match(line)):
            print(f"Invalid line: {line=}")
            continue

        action, action_type, value = line_match.groups()
        if not (value_matcher := value_matchers.get(action_type)):
            print(f"Invalid action type: {line=}")
            continue

        if not value_matcher.match(value):
            print(f"Invalid {action_type} value: {line=}")
            continue

        # Do some work on the items
        ...