I am trying to sort out specific paragraph by using regular expression in python.
here is an input.txt file.
some random texts (100 lines)
bbb
...
ttt
some random texts
ccc
...
fff
paragraph_a A_story(
...
some random texts adfsasdsd
...
)
paragraph_b different_story(
...
some random texts
...
)
expected output is here:
some random texts (100 lines)
bbb
...
ttt
some random texts
ccc
...
fff
paragraph_b different_story(
...
some random texts
...
)
What I want to do is to delete all the paragraph_a contents (including parenthesis) but It should be deleted by the name of the below-line paragraph(in this case, paragraph_b) because the contents of the to-be-deleted paragraph(in this case, paragraph_a) is random.
I've managed to make regular expression to select Only the paragraph that is located right above paragraph_b
https://regex101.com/r/pwGVbe/1 <- you can refer to it in here.
However, By using this regular expression I couldn't delete the thing I want.
here is what I've done so far:
import re
output = open ('output.txt', 'w')
input = open('input.txt', 'r')
for line in input:
# print(line)
t = re.sub('^(\w \w \((?:(.|\n)*)\))\s*^paragraph_b','', line)
output.write(t)
Is there anything I can get some solution or clue? Any answer or advice would be appreciated.
Thanks.
CodePudding user response:
You can match the paragraph before by asserting paragraph_b and not cross more paragraphs.
Note that input
is a reserved keyword, so instead of writing input = open('input.txt', 'r')
you might write it like this input_file = open('file', 'r')
^\w \w \((?:\n(?!^\w \w \().*)*\)(?=\s*^paragraph_b)
If the match also should not start with paragraph_b itself:
^(?!paragraph_b)\w \w \((?:\n(?!^\w \w \().*)*\)(?=\s*^paragraph_b)
Example, using input_file.read()
to read the whole file:
import re
output_file = open('file_out', 'w')
input_file = open('file', 'r')
t = re.sub(
'^(?!paragraph_b)\w \w \((?:\n(?!^\w \w \().*)*\)(?=\s*^paragraph_b)',
'',
input_file.read(),
0,
re.M
)
output_file.write(t)
Contents of output.txt
some random texts (100 lines)
bbb
...
ttt
some random texts
ccc
...
fff
paragraph_b different_story(
...
some random texts
...
)
CodePudding user response:
Your code doesn't work because you're parsing text line by line:
for line in input:
That way your regex has no chance to match entire file content. You're better off reading it all at once and store it in single string variable, then apply your modifications with regex using that string variable.