I have a list of start_phrases and stop_phrase.
I want to parse the file and write to output file as below: If I see the line contains ONLY start_phrases, I want to start writing/appending start_phrase to output file. And then continue to append the consecutive lines to output file.
When the line starts with the stop_phrases, then I want to stop parsing and break the loop. I don't want to append the stop_phrase to the output.
start_phrases = ["Hello", "Come on:", "Introduction", "Background"]
stop_phrases = ["This is provided to assist", "The background knowledge is to know"]
I am reading a file as below.
with open (data, "r", encoding='utf-8') as myfile:
for line in myfile:
line.strip()
print(line)
How to include these conditions. Thanks.
CodePudding user response:
You can use regex expressions:
import re
start_phrases = ["Hello", "Come on:", "Introduction", "Background"]
stop_phrases = ["This is provided to assist", "The background knowledge is to know"]
start_regex = re.compile(f'(?i)^\s*({"|".join(start_phrases)})\s*$')
stop_regex = re.compile(f'(?i)^\s*({"|".join(stop_phrases)})\s*$')
parse = False
with open (data, "r", encoding='utf-8') as myfile:
for line in myfile:
if stop_regex.match(line):
break
parse = parse or start_regex.match(line)
if parse:
print(line)
You can create a regex to find start sentences and another for stop sentences.
The bool parse
keeps the status: if it is True
, the current line is parsed, otherwise is skipped.
Suppose that the content of the input file is:
aaaa
Hello WOrld
Hello
cccc
dddd
This is provided to assist gre
bbbb
This is provided to assist
kkkk
pppppp
the output is:
Hello
cccc
dddd
This is provided to assist gre
bbbb