I am currently facing a problem. I am trying to write a regex code in order to match a pattern in a text file, and after finding it, remove it from the current text.
# Reading the file data and store it
with open('file.txt','r ') as f:
file = f.read()
print(file)
Here is my text when printed
'{\n\tINFO\tDATA_NUMBER\t974\n\t{\n\t\tDAT_CQFD\n\t\t{\n\t\t\tsome random text \t787878\n\t\t}\n\t\tDATA_TO_MATCH\n\t\t{\n\t\t1\tbunch of characters \t985\n\t\t2\tbunch of data\t\t78\n\t\t}\n\t}\n\tINFO\tDATA_CATCHME\t123\n\t{\n\t\t3\tbunch of characters \n\t\t2\tbunch of datas\n\t}\n\tINFO\tDATA_INTACT\t\t456\n\t{\n\t\t3\tbunch of numbers \n\t\t2\tbunch of texts\n\t}\n\n'
Here is a picture of the same text opened with an editor : image here I would like to match / search DATA_TO_MATCH and then look for the last closed bracket " } " and remove everything between this close bracket and the next one included. And I would like to do the same for DATA_CATCHME.
here is the expected result :
'{\n\tINFO\tDATA_NUMBER\t974\n\t{\n\t\tDATA_CQFD\n\t\t{\n\t\t\tsome random text \t787878\n\t\t}\n\n\t}\n\tINFO\tDATA_INTACT\t\t456\n\t{\n\t\t3\tbunch of numbers \n\t\t2\tbunch of texts\n\t}\n\n}\n'
Here is a picture of the same text result opened with an editor : image here
I tried some
import re
#find the DATA_TO_MATCH
re.findall(r".*DATA_TO_MATCH",file)
#find the DATA_CATCHME
re.findall(r".*DATA_CATCHME",file)
#supposed to find everything before the closed bracket "}"
re.findall(r"(?=.*})[^}]*",file)
But I am not very familiar with regex and re, and i can't get what i want from it, and I guess when it's found I will use
re.sub(my_patern,'', text)
to remove it from my text file
CodePudding user response:
The main trick here is to use the re.MULTILINE
flag, which will span across lines. You should also use re.sub
directly rather than re.findall
.
The regex itself is simple once you understand it. You look for any characters until DATA_TO_MATCH
, then chew up any whitespace that may exist (hence the *
), read a {
, then read all characters that aren't a }
, and finally consume the }
. It's a very similar tactic for the second one.
import re
with open('input.txt', 'r ') as f:
file = f.read()
# find the DATA_TO_MATCH
file = re.sub(r".*DATA_TO_MATCH\s*{[^}]*}", "", file, flags=re.MULTILINE)
# find the DATA_CATCHME
file = re.sub(r".*DATA_CATCHME[^{]*{[^}]*}", "", file, flags=re.MULTILINE)
print(file)