Home > other >  regex: cleaning text: remove everything upto a certain line
regex: cleaning text: remove everything upto a certain line

Time:12-24

I have a text file containing The Tragedie of Macbeth. I want to clean it and the first step is to remove everything upto the line The Tragedie of Macbeth and store the remaining part in removed_intro_file.

I tried:

import re
filename, title = 'MacBeth.txt', 'The Tragedie of Macbeth'
with open(filename, 'r') as file:
    removed_intro = file.read()
    with open('removed_intro_file', 'w') as output:
        removed = re.sub(title, '', removed_intro)
        print(removed)
        output.write(removed)

The print statement doesn't print anything so it doesn't match anything. How can I use regex over several lines? Should one instead use pointers that point to the start and end of the lines to removed? I'd also be glad to know if there is a nicer way to solve this maybe not using regex.

CodePudding user response:

your regex only replaces title with ''; you want to remove the title and all text before it, so search for all characters (including newlines) from the beginning of the string to the title included; this should work (I only tested it on a sample file I wrote):

removed = re.sub(r'(?s)^.*' re.escape(title), '', removed_intro)

CodePudding user response:

We can try reading your file line by line until hitting the target line. After that, read all subsequent lines into the output file.

filename, title = 'MacBeth.txt', 'The Tragedie of Macbeth'
line = ""
with open(filename, 'r') as file:
    while line != title:                 # discard all lines before the Macbeth title
        line = file.readline()
    lines = '\n'.join(file.readlines())  # read all remaining lines
    with open('removed_intro_file', 'w') as output:
        output.write(title   "\n"   lines)

This approach is probably faster and more efficient than using a regex approach.

  • Related