Home > Back-end >  Start and End of Text file parsing with some conditions
Start and End of Text file parsing with some conditions

Time:12-04

I have a list of start_phrases and stop_phrase.

I want to parse the file and write to output file as below: If I see the line contains ONLY start_phrases, I want to start writing/appending start_phrase to output file. And then continue to append the consecutive lines to output file.

When the line starts with the stop_phrases, then I want to stop parsing and break the loop. I don't want to append the stop_phrase to the output.

start_phrases = ["Hello", "Come on:", "Introduction", "Background"]
stop_phrases = ["This is provided to assist", "The background knowledge is to know"]

I am reading a file as below.

with open (data, "r", encoding='utf-8') as myfile:
    for line in myfile:
        line.strip()
            print(line)

How to include these conditions. Thanks.

CodePudding user response:

You can use regex expressions:

import re

start_phrases = ["Hello", "Come on:", "Introduction", "Background"]
stop_phrases = ["This is provided to assist", "The background knowledge is to know"]

start_regex = re.compile(f'(?i)^\s*({"|".join(start_phrases)})\s*$')
stop_regex = re.compile(f'(?i)^\s*({"|".join(stop_phrases)})\s*$')

parse = False
with open (data, "r", encoding='utf-8') as myfile:
    for line in myfile:
        if  stop_regex.match(line):
            break

        parse = parse or start_regex.match(line)
        if parse:
            print(line)

You can create a regex to find start sentences and another for stop sentences.

The bool parse keeps the status: if it is True, the current line is parsed, otherwise is skipped.

Suppose that the content of the input file is:

aaaa
Hello WOrld
Hello
cccc
dddd
This is provided to assist gre
bbbb
This is provided to assist
kkkk
pppppp

the output is:

Hello

cccc

dddd

This is provided to assist gre

bbbb
  • Related