Home > Software engineering >  filter non-nested tag values from XML
filter non-nested tag values from XML

Time:01-23

I have an xml that looks like this.

<?xml version="1.0" encoding="UTF-8" ?>
<main_heading timestamp="20220113">
<details>
    <offer id="11" parent_id="12">
        <name>Alpha</name>
        <pos>697</pos>
        <kat_pis>
            <pos kat="2">112</pos>
        </kat_pis>
    </offer>
    <offer id="12" parent_id="31">
        <name>Beta</name>
        <pos>099</pos>
        <kat_pis>
            <pos kat="2">113</pos>
        </kat_pis>
    </offer>
</details>
</main_heading>

I am parsing it using BeautifulSoup. Upon doing this:

soup = BeautifulSoup(file, 'xml')

pos = []
for i in (soup.find_all('pos')):
    pos.append(i.text)

I get a list of all POS tag values, even the ones that are nested within the tag kat_pis.

So I get (697, 112, 099. 113).

However, I only want to get the POS values of the non-nested tags.

Expected desired result is (697, 099).

How can I achieve this?

CodePudding user response:

I think the best solution would be to abandon BeautifulSoup for an XML parser with XPath support, like lxml. Using XPath expressions, you can ask for only those tos elements that are children of offer elements:

from lxml import etree

with open('data.xml') as fd:
    doc = etree.parse(fd)

pos = []
for ele in (doc.xpath('//offer/pos')):
    pos.append(ele.text)

print(pos)

Given your example input, the above code prints:

['697', '099']

CodePudding user response:

Here is one way of getting those first level pos:

from bs4 import BeautifulSoup as bs

xml_doc = '''<?xml version="1.0" encoding="UTF-8" ?>
<main_heading timestamp="20220113">
<details>
    <offer id="11" parent_id="12">
        <name>Alpha</name>
        <pos>697</pos>
        <kat_pis>
            <pos kat="2">112</pos>
        </kat_pis>
    </offer>
    <offer id="12" parent_id="31">
        <name>Beta</name>
        <pos>099</pos>
        <kat_pis>
            <pos kat="2">113</pos>
        </kat_pis>
    </offer>
</details>
</main_heading>'''

soup = bs(xml_doc, 'xml')

pos = []
for i in (soup.select('offer > pos')):
    pos.append(i.text)

print(pos)

Result in terminal:

['697', '099']
  • Related