[Python] I'm trying to retrieve any element in an XML document that has an href
attribute, at any level of the XML document. For example:
<OuterElement href='a.com'>
<InnerElement>
<NestedInner href='b.com' />
<NestedInner href='c.com' />
<NestedInner />
</InnerElement>
<InnerElement href='d.com'/>
</OuterElement>
Would retrieve the following elements (as lxml element objects,simplified for visual clarity):
[<OuterElement href='a.com'>, <NestedInner href='b.com' />, <NestedInner href='c.com' />, <InnerElement href='d.com'/>]
I've tried using the following code to retrieve any element with an href tag, but it retrieves zero elements on a file full of elements with href attributes:
with(open(file, 'rb')) as f:
xml_tree = etree.parse(f)
href_elements = xml_tree.xpath(".//*[@href]")
Shouldn't this code select any element (.//*
) with the specified attribute ([@href]
)? From my understanding (definitely correct me if I am wrong, I most likely am), href_elements
should be an array of lxml element objects that each have an href attribute.
important clarification: I have seen many people asking about xpath on Stack Overflow, but I have yet to find a solved question about how to search through all elements in an xml and retrieve every element that fits a criteria (such as href).
CodePudding user response:
Based on ElementTree
import xml.etree.ElementTree as ET
xml = '''<OuterElement href='a.com'>
<InnerElement>
<NestedInner href='b.com' />
<NestedInner href='c.com' />
<NestedInner />
</InnerElement>
<InnerElement href='d.com'/>
</OuterElement>'''
root = ET.fromstring(xml)
elements_with_href = [root] if 'href' in root.attrib else []
elements_with_href.extend(root.findall('.//*[@href]'))
for e in elements_with_href:
print(f'{e.tag} : {e.attrib["href"]}')
output
OuterElement : a.com
NestedInner : b.com
NestedInner : c.com
InnerElement : d.com