I have input txt file with multiple lines, I want to split each line and start new line after list of optional keywords detected in current line. To do so, I composed pattern where possible keywords might appear each line, if so, I just want to split right after them. In my current attempt, not able to split the line if keywords detected. Is there any better idea to do this with regex
in python? any thoughts?
use case
here is part of input txt that I want to check its format and split the line if I see key words either in the middle or end of line.
input txt
CREATE APPLICATION vod2_sar_app_wak1_tst_xor WITH ENCRYPTION RECOVERY 30 SECOND INTERVAL AUTORESUME MAXRETRIES 2 RETRYINTERVAL 60;
CREATE FLOW flow_src_vod2_sar_app_wak1_tst_xor;
CREATE OR REPLACE SOURCE vod2_epc_app_cln1_tst_comp_src_cdc USING Global.reader (
DicMod: 'on',
url: '$src_url',
trans: true,
toSend: true ) OUTPUT TO stream_src_vod2_sar_app_wak1_tst_xor;
CREATE OR REPLACE TARGET comp_tgt_vod2_sar_app_wak1_tst_xor USING Global.wirter (
stdSQL: 'true',
hasID: 'true',
delim: '\"' )
INPUT FROM vod2_sar_app_wak1_tst_xor_text_clean_stream;
CREATE OR REPLACE Hm Hm_vod2_sar_app_wak1_tst_xor INSERT INTO Hm_STREAM_vod2_sar_app_wak1_tst_xor SELECT putUserdata(s,'Converted_SCN',META(s,"SCN").toString(),'DNOW', to_long(DNow()),'Created_time',DNOW(),'Nanoseconds',System.nanoTime()) FROM stream_src_vod2_sar_app_wak1_tst_xor s;
CREATE OR REPLACE OPEN PROCESSOR vod2_sar_app_wak1_tst_xor_text_clean USING Global.TextCleaner (
replaceFrom: '[\\u0000]',
includeBefore: true )
INSERT INTO vod2_sar_app_wak1_tst_xor_text_clean_stream FROM Hm_STREAM_vod2_sar_app_wak1_tst_xor;
END FLOW flow_tgt_vod2_sar_app_wak1_tst_xor;
END APPLICATION vod2_sar_app_wak1_tst_xor;
my current attempt:
line_patt = re.compile(r"(INPUT(?: FROM)?|(INSERT(?: INTO))|(OUPUT(?: TO)))")
with open('input.txt', 'r ') as f:
lines = f.readlines()
nlines = [v for v in lines if not v.isspace()]
for line in nlines:
if match := line_patt.match(line):
line.split('\n')
else:
continue
but this is not giving me desired format. How should we do this correctly with regex
?
desired format for input
this is how I want the input to be formatted:
CREATE APPLICATION vod2_sar_app_wak1_tst_xor WITH ENCRYPTION RECOVERY 30 SECOND INTERVAL AUTORESUME MAXRETRIES 2 RETRYINTERVAL 60;
CREATE FLOW flow_src_vod2_sar_app_wak1_tst_xor;
CREATE OR REPLACE SOURCE vod2_epc_app_cln1_tst_comp_src_cdc USING Global.reader (
DicMod: 'on',
url: '$src_url',
trans: true,
toSend: true )
OUTPUT TO stream_src_vod2_sar_app_wak1_tst_xor;
CREATE OR REPLACE TARGET comp_tgt_vod2_sar_app_wak1_tst_xor USING Global.wirter (
stdSQL: 'true',
hasID: 'true',
delim: '\"' )
INPUT FROM vod2_sar_app_wak1_tst_xor_text_clean_stream;
CREATE OR REPLACE Hm Hm_vod2_sar_app_wak1_tst_xor INSERT INTO Hm_STREAM_vod2_sar_app_wak1_tst_xor SELECT putUserdata(s,'Converted_SCN',META(s,"SCN").toString(),'DNOW', to_long(DNow()),'Created_time',DNOW(),'Nanoseconds',System.nanoTime())
FROM stream_src_vod2_sar_app_wak1_tst_xor s;
CREATE OR REPLACE OPEN PROCESSOR vod2_sar_app_wak1_tst_xor_text_clean USING Global.TextCleaner (
replaceFrom: '[\\u0000]',
includeBefore: true )
INSERT INTO vod2_sar_app_wak1_tst_xor_text_clean_stream
FROM Hm_STREAM_vod2_sar_app_wak1_tst_xor;
END FLOW flow_tgt_vod2_sar_app_wak1_tst_xor;
END APPLICATION vod2_sar_app_wak1_tst_xor;
CodePudding user response:
In your code, you are currently not doing anything with line.split('\n')
and the split also does not take any of the matches into account.
Also note that there is a typo: OUPUT
==> OUTPUT
In this part for example (INSERT(?: INTO))
the non capture group has no purpose, so you can write that just as (INSERT INTO)
But looking at the example data, INPUT is optional and FROM occurs two times, so that would be (?:INPUT )?FROM
One option could be reading the whole file at once, and use re.sub to replace the match with a newline followed by what is matched.
As you don't want to prepend a newline before one of the alternatives, you can match them asserting not the start of the string directly to the left, and only match FROM when it is not preceded by INPUT
In the replacement use a newline and then the full match using \n\g<0>
\b(?<!^)(?:INPUT FROM|INSERT INTO|OUTPUT TO|(?<!^INPUT )FROM)\b
import re
line_patt = re.compile(r"\b(?<!^)(?:INPUT FROM|INSERT INTO|OUTPUT TO|(?<!^INPUT )FROM)\b", re.M)
with open('input.txt', 'r ') as f:
lines = line_patt.sub(r"\n\g<0>", f.read())
print(lines)
Output
CREATE APPLICATION vod2_sar_app_wak1_tst_xor WITH ENCRYPTION RECOVERY 30 SECOND INTERVAL AUTORESUME MAXRETRIES 2 RETRYINTERVAL 60;
CREATE FLOW flow_src_vod2_sar_app_wak1_tst_xor;
CREATE OR REPLACE SOURCE vod2_epc_app_cln1_tst_comp_src_cdc USING Global.reader (
DicMod: 'on',
url: '$src_url',
trans: true,
toSend: true )
OUTPUT TO stream_src_vod2_sar_app_wak1_tst_xor;
CREATE OR REPLACE TARGET comp_tgt_vod2_sar_app_wak1_tst_xor USING Global.wirter (
stdSQL: 'true',
hasID: 'true',
delim: '\"' )
INPUT FROM vod2_sar_app_wak1_tst_xor_text_clean_stream;
CREATE OR REPLACE Hm Hm_vod2_sar_app_wak1_tst_xor
INSERT INTO Hm_STREAM_vod2_sar_app_wak1_tst_xor SELECT putUserdata(s,'Converted_SCN',META(s,"SCN").toString(),'DNOW', to_long(DNow()),'Created_time',DNOW(),'Nanoseconds',System.nanoTime())
FROM stream_src_vod2_sar_app_wak1_tst_xor s;
CREATE OR REPLACE OPEN PROCESSOR vod2_sar_app_wak1_tst_xor_text_clean USING Global.TextCleaner (
replaceFrom: '[\\u0000]',
includeBefore: true )
INSERT INTO vod2_sar_app_wak1_tst_xor_text_clean_stream
FROM Hm_STREAM_vod2_sar_app_wak1_tst_xor;
END FLOW flow_tgt_vod2_sar_app_wak1_tst_xor;
END APPLICATION vod2_sar_app_wak1_tst_xor;
CodePudding user response:
You can do something like this after you read your entire text to variable text
assuming that the desired list of keywords are in key_words
key_words = ['FROM', 'INPUT', 'OUTPUT']
for k in key_words:
b = text.split(k)
for i in range(1,len(b)):
b[i] = k b[i]
text = '\n'.join(b)