I've got a txt file that essentially reads like this:
line
line
line
<tag>
info
info
info
</tag>
<tag>
info
info
info
</tag>
line
line
I want to edit the file such that it writes the info lines (including the tags, which are the same in both instances), and not the other lines. After this I'll export as an xml and upload into Excel as a table.
I've tried two variations so far, with no luck:
1
import re
with open('document.txt') as test:
for line in test:
target = "<tag>(.*?)</tag>"
res = re.findall(target, str(test))
test.write(str(res))
This seems to just return an empty list and prints [] at the end of my document.
2
with open('document.txt') as test:
parsing = False
for line in test:
with open('document.txt') as test:
if line.startswith("<tag>"):
parsing = True
elif line.startswith("</tag>"):
parsing = False
if parsing==True:
test.write(line)
This just messes up my document and places various text/tags in weird places
e.g. I started with
i
<tag>j</tag>
k
<tag>l</tag>
m
as a test, and ended up with
mtag>l</tag>
>
k
<tag>l</tag>
m
I'm pretty new to Python (if you couldn't tell) so apologies if there's a pretty easy fix to this.
Thanks in advance.
CodePudding user response:
You could do it like this:
with open('document.txt', 'r') as file:
lines = file.readlines()
output = []
inside_tag = False
for line in lines:
if line.strip() == '<tag>':
inside_tag = True
output.append(line)
continue
elif line.strip() == '</tag>':
inside_tag = False
output.append(line)
continue
elif inside_tag:
output.append(line)
with open('output.xml', 'w') as file:
file.writelines(output)
The output.xml will contain the following:
<tag>
info
info
info
</tag>
<tag>
info
info
info
</tag>
If you want to remove the tabs before the info
then you can simply use output.append(line.strip() '\n')
instead of output.append(line)
CodePudding user response:
Your version using re
is along the lines of the right way to go. You could instead do:
import re
findtag = "tag"
pattern = rf"<{findtag}>(.*?)</{findtag}>" # make pattern
# get input
with open("document.txt", "r") as fp:
data = fp.read() # read in all the data to a string
results = re.findall(pattern, data, flags=re.DOTALL) # DOTALL finds over multiple lines
# print out results (you could write it to a file instead)
for res in results:
print(f"<{findtag}>")
for item in res.strip().split("\n"):
print(item)
print(f"</{findtag}>")