Using XML file
<Data
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<Header>
<HeaderCode>My header</HeaderCode>
</Header>
<Customers>
<Customer>
<Code>4ESV1VPTKNW</Code>
<Additional>
<Info>PL</Info>
</Additional>
</Customer>
<Customer>
<Code>GNOAFMAPJIG</Code>
<Additional>
<Info>BL</Info>
</Additional>
</Customer>
</Customers>
<Trailer>
<FileCreationDate>20200716</FileCreationDate>
<RecordCount>10</RecordCount>
</Trailer>
</Data>
How can I get all unique nodes which don't have child elements?
/Data/Header/HeaderCode
/Data/Customers/Customer/Code
/Data/Customers/Customer/Additional/Info
/Data/Trailer/FileCreationDate
/Data/Trailer/RecordCount
My current code looks like this:
tree = etree.parse(open('my_file.xml'))
for node in tree.xpath('//*'):
if not node.getchildren():
print(tree.getpath(node))
Are there any build-in methods or xpaths for this purpose?
CodePudding user response:
How can I get all unique nodes which don't have child elements?
A recursive method is doing the job.
Key points of the solution:
len(list(element)) == 0:
means node has no child eleemntspath_holder
is being used to hold the current location in the tree
code:
import xml.etree.ElementTree as ET
xml = '''<Data
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<Header>
<HeaderCode>My header</HeaderCode>
</Header>
<Customers>
<Customer>
<Code>4ESV1VPTKNW</Code>
<Additional>
<Info>PL</Info>
</Additional>
</Customer>
<Customer>
<Code>GNOAFMAPJIG</Code>
<Additional>
<Info>BL</Info>
</Additional>
</Customer>
</Customers>
<Trailer>
<FileCreationDate>20200716</FileCreationDate>
<RecordCount>10</RecordCount>
</Trailer>
</Data>'''
terminal_nodes = set()
root = ET.fromstring(xml)
path = []
def collect_terminal_nodes(element,holder,path_holder):
if len(list(element)) == 0:
path_holder.append(element.tag)
holder.add('/'.join(path_holder))
path_holder.pop()
else:
path_holder.append(element.tag)
for e in list(element):
collect_terminal_nodes(e,holder,path_holder)
path_holder.pop()
collect_terminal_nodes(root,terminal_nodes,path)
for idx,node in enumerate(terminal_nodes,1):
print(f'{idx}) {node}')
output
1) Data/Trailer/RecordCount
2) Data/Customers/Customer/Additional/Info
3) Data/Header/HeaderCode
4) Data/Customers/Customer/Code
5) Data/Trailer/FileCreationDate
CodePudding user response:
Easiest way to test does node have child nodes it truth value test. bool(node)
will be false if there's no child nodes. So we can do simple recursive generator function which will iterate over node and yield it's absolute path if node has no child:
def iter_childless(node, path=""):
for sub_node in node:
next_path = (path or node.tag) "/" sub_node.tag
if sub_node: # has child
yield from iter_childless(sub_node, next_path)
else:
yield next_path
The easiest way to filter possible duplicates is to save results of iter_childless()
into a set:
import xml.etree.ElementTree as ET
source = '''<Data
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<Header>
<HeaderCode>My header</HeaderCode>
</Header>
<Customers>
<Customer>
<Code>4ESV1VPTKNW</Code>
<Additional>
<Info>PL</Info>
</Additional>
</Customer>
<Customer>
<Code>GNOAFMAPJIG</Code>
<Additional>
<Info>BL</Info>
</Additional>
</Customer>
</Customers>
<Trailer>
<FileCreationDate>20200716</FileCreationDate>
<RecordCount>10</RecordCount>
</Trailer>
</Data>'''
...
root = ET.fromstring(source)
childless_nodes = set(iter_childless(root))
print(*childless_nodes, sep="\n")
CodePudding user response:
Here is a way to do it with lxml that makes use of the getpath()
function. A regular expression is used to remove positional predicates ([1]
, [2]
) from the paths.
from lxml import etree
import re
re_predicate = re.compile('\[\d \]')
tree = etree.parse('my_file.xml')
paths = []
for node in tree.xpath('//*'):
if not list(node): # getchildren() is deprecated
p = tree.getpath(node)
p = re_predicate.sub('', p) # Remove [1], [2] etc. from the path
paths.append(p)
for p in set(paths): # Get unique paths
print(p)