Home > Software design >  How can I get the href attribute of *any* element in an XML (included deeply nested ones) using xpat
How can I get the href attribute of *any* element in an XML (included deeply nested ones) using xpat

Time:11-18

[Python] I'm trying to retrieve any element in an XML document that has an href attribute, at any level of the XML document. For example:

<OuterElement href='a.com'>
  <InnerElement>
    <NestedInner href='b.com' />
    <NestedInner href='c.com' />
    <NestedInner />
  </InnerElement>
  <InnerElement href='d.com'/>
</OuterElement>

Would retrieve the following elements (as lxml element objects,simplified for visual clarity):

[<OuterElement href='a.com'>, <NestedInner href='b.com' />, <NestedInner href='c.com' />, <InnerElement href='d.com'/>]

I've tried using the following code to retrieve any element with an href tag, but it retrieves zero elements on a file full of elements with href attributes:

with(open(file, 'rb')) as f:
    xml_tree = etree.parse(f)
    href_elements = xml_tree.xpath(".//*[@href]")

Shouldn't this code select any element (.//*) with the specified attribute ([@href])? From my understanding (definitely correct me if I am wrong, I most likely am), href_elements should be an array of lxml element objects that each have an href attribute.

important clarification: I have seen many people asking about xpath on Stack Overflow, but I have yet to find a solved question about how to search through all elements in an xml and retrieve every element that fits a criteria (such as href).

CodePudding user response:

Based on ElementTree

import xml.etree.ElementTree as ET

xml = '''<OuterElement href='a.com'>
  <InnerElement>
    <NestedInner href='b.com' />
    <NestedInner href='c.com' />
    <NestedInner />
  </InnerElement>
  <InnerElement href='d.com'/>
</OuterElement>'''

root = ET.fromstring(xml)
elements_with_href = [root] if 'href' in root.attrib else []
elements_with_href.extend(root.findall('.//*[@href]'))
for e in elements_with_href:
  print(f'{e.tag} : {e.attrib["href"]}')

output

OuterElement : a.com
NestedInner : b.com
NestedInner : c.com
InnerElement : d.com
  • Related