Python: how to print lines from a text file that appear between two tags?-CodePudding

I've got a txt file that essentially reads like this:

line
line
line
<tag>
   info
   info
   info
</tag>
<tag>
   info
   info
   info
</tag>
line
line

I want to edit the file such that it writes the info lines (including the tags, which are the same in both instances), and not the other lines. After this I'll export as an xml and upload into Excel as a table.

I've tried two variations so far, with no luck:

import re

with open('document.txt') as test:    
    for line in test:
        target = "<tag>(.*?)</tag>"
        res = re.findall(target, str(test))
        test.write(str(res))

This seems to just return an empty list and prints [] at the end of my document.

with open('document.txt') as test:
    parsing = False
    for line in test:
        with open('document.txt') as test:
            if line.startswith("<tag>"):
                parsing = True
            elif line.startswith("</tag>"):
                parsing = False
            if parsing==True:
                test.write(line)

This just messes up my document and places various text/tags in weird places

e.g. I started with

i
<tag>j</tag>
k
<tag>l</tag>
m

as a test, and ended up with

mtag>l</tag>
>
k
<tag>l</tag>
m

I'm pretty new to Python (if you couldn't tell) so apologies if there's a pretty easy fix to this.

Thanks in advance.

CodePudding user response：

You could do it like this:

with open('document.txt', 'r') as file:
    lines = file.readlines()

output = []
inside_tag = False

for line in lines:
    if line.strip() == '<tag>':
        inside_tag = True
        output.append(line)
        continue
    elif line.strip() == '</tag>':
        inside_tag = False
        output.append(line)
        continue
    elif inside_tag:
        output.append(line)


with open('output.xml', 'w') as file:
    file.writelines(output)

The output.xml will contain the following:

<tag>
   info
   info
   info
</tag>
<tag>
   info
   info
   info
</tag>

If you want to remove the tabs before the info then you can simply use output.append(line.strip() '\n') instead of output.append(line)

CodePudding user response：

Your version using re is along the lines of the right way to go. You could instead do:

import re

findtag = "tag"
pattern = rf"<{findtag}>(.*?)</{findtag}>"   # make pattern

# get input
with open("document.txt", "r") as fp:
    data = fp.read()  # read in all the data to a string

results = re.findall(pattern, data, flags=re.DOTALL)  # DOTALL finds over multiple lines

# print out results (you could write it to a file instead) 
for res in results:
    print(f"<{findtag}>")
    for item in res.strip().split("\n"):
        print(item)
    print(f"</{findtag}>")