regex to match a pattern and then delete it from a text file-CodePudding

I am currently facing a problem. I am trying to write a regex code in order to match a pattern in a text file, and after finding it, remove it from the current text.

# Reading the file data and store it 

with open('file.txt','r ') as f:
    file = f.read()
print(file)

Here is my text when printed

'{\n\tINFO\tDATA_NUMBER\t974\n\t{\n\t\tDAT_CQFD\n\t\t{\n\t\t\tsome random text \t787878\n\t\t}\n\t\tDATA_TO_MATCH\n\t\t{\n\t\t1\tbunch of characters \t985\n\t\t2\tbunch of data\t\t78\n\t\t}\n\t}\n\tINFO\tDATA_CATCHME\t123\n\t{\n\t\t3\tbunch of characters \n\t\t2\tbunch of datas\n\t}\n\tINFO\tDATA_INTACT\t\t456\n\t{\n\t\t3\tbunch of numbers \n\t\t2\tbunch of texts\n\t}\n\n'

Here is a picture of the same text opened with an editor : image here I would like to match / search DATA_TO_MATCH and then look for the last closed bracket " } " and remove everything between this close bracket and the next one included. And I would like to do the same for DATA_CATCHME.

here is the expected result :

'{\n\tINFO\tDATA_NUMBER\t974\n\t{\n\t\tDATA_CQFD\n\t\t{\n\t\t\tsome random text \t787878\n\t\t}\n\n\t}\n\tINFO\tDATA_INTACT\t\t456\n\t{\n\t\t3\tbunch of numbers \n\t\t2\tbunch of texts\n\t}\n\n}\n'

Here is a picture of the same text result opened with an editor : image here

I tried some

import re
#find the DATA_TO_MATCH
re.findall(r".*DATA_TO_MATCH",file)  

#find the DATA_CATCHME
re.findall(r".*DATA_CATCHME",file)  

#supposed to find everything before the closed bracket "}"  
re.findall(r"(?=.*})[^}]*",file)

But I am not very familiar with regex and re, and i can't get what i want from it, and I guess when it's found I will use

re.sub(my_patern,'', text)

to remove it from my text file

CodePudding user response：

The main trick here is to use the re.MULTILINE flag, which will span across lines. You should also use re.sub directly rather than re.findall.

The regex itself is simple once you understand it. You look for any characters until DATA_TO_MATCH, then chew up any whitespace that may exist (hence the *), read a {, then read all characters that aren't a }, and finally consume the }. It's a very similar tactic for the second one.

import re

with open('input.txt', 'r ') as f:
    file = f.read()

# find the DATA_TO_MATCH
file = re.sub(r".*DATA_TO_MATCH\s*{[^}]*}", "", file, flags=re.MULTILINE)

# find the DATA_CATCHME
file = re.sub(r".*DATA_CATCHME[^{]*{[^}]*}", "", file, flags=re.MULTILINE)
print(file)