replace specific pattern by using regular expression in python-CodePudding

I am trying to sort out specific paragraph by using regular expression in python.

here is an input.txt file.

some random texts (100  lines)
bbb
...
ttt
some random texts
ccc
...
fff    
paragraph_a A_story(

...
some random texts adfsasdsd

...
)

paragraph_b different_story(
...
some random texts
...
)

expected output is here:

some random texts (100  lines)
bbb
...
ttt
some random texts
ccc
...
fff    

paragraph_b different_story(
...
some random texts
...
)

What I want to do is to delete all the paragraph_a contents (including parenthesis) but It should be deleted by the name of the below-line paragraph(in this case, paragraph_b) because the contents of the to-be-deleted paragraph(in this case, paragraph_a) is random.

I've managed to make regular expression to select Only the paragraph that is located right above paragraph_b

https://regex101.com/r/pwGVbe/1 <- you can refer to it in here.

However, By using this regular expression I couldn't delete the thing I want.

here is what I've done so far:

import re

output = open ('output.txt', 'w')
input = open('input.txt', 'r')

for line in input:
#    print(line)
    t = re.sub('^(\w  \w \((?:(.|\n)*)\))\s*^paragraph_b','', line)
    output.write(t)

Is there anything I can get some solution or clue? Any answer or advice would be appreciated.

Thanks.

CodePudding user response：

You can match the paragraph before by asserting paragraph_b and not cross more paragraphs.

Note that input is a reserved keyword, so instead of writing input = open('input.txt', 'r') you might write it like this input_file = open('file', 'r')

 ^\w  \w \((?:\n(?!^\w  \w \().*)*\)(?=\s*^paragraph_b)

Regex demo

If the match also should not start with paragraph_b itself:

^(?!paragraph_b)\w  \w \((?:\n(?!^\w  \w \().*)*\)(?=\s*^paragraph_b)

Regex demo

Example, using input_file.read() to read the whole file:

import re

output_file = open('file_out', 'w')
input_file = open('file', 'r')

t = re.sub(
    '^(?!paragraph_b)\w  \w \((?:\n(?!^\w  \w \().*)*\)(?=\s*^paragraph_b)',
    '',
    input_file.read(),
    0,
    re.M
)
output_file.write(t)

Contents of output.txt

some random texts (100  lines)
bbb
...
ttt
some random texts
ccc
...
fff    


paragraph_b different_story(
...
some random texts
...
)

CodePudding user response：

Your code doesn't work because you're parsing text line by line:

for line in input:

That way your regex has no chance to match entire file content. You're better off reading it all at once and store it in single string variable, then apply your modifications with regex using that string variable.