Home > other >  How to get all unique XML nodes which don't have child elements?
How to get all unique XML nodes which don't have child elements?

Time:10-02

Using XML file

<Data
    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
    <Header>
        <HeaderCode>My header</HeaderCode>
    </Header>
    <Customers>
        <Customer>
            <Code>4ESV1VPTKNW</Code>
            <Additional>
                <Info>PL</Info>
            </Additional>
        </Customer>
        <Customer>
            <Code>GNOAFMAPJIG</Code>
            <Additional>
                <Info>BL</Info>
            </Additional>
        </Customer>
    </Customers>
    <Trailer>
        <FileCreationDate>20200716</FileCreationDate>
        <RecordCount>10</RecordCount>
    </Trailer>
</Data>

How can I get all unique nodes which don't have child elements?

/Data/Header/HeaderCode
/Data/Customers/Customer/Code
/Data/Customers/Customer/Additional/Info
/Data/Trailer/FileCreationDate
/Data/Trailer/RecordCount

My current code looks like this:

tree = etree.parse(open('my_file.xml'))
for node in tree.xpath('//*'):
    if not node.getchildren():
        print(tree.getpath(node))

Are there any build-in methods or xpaths for this purpose?

CodePudding user response:

How can I get all unique nodes which don't have child elements?

A recursive method is doing the job.

Key points of the solution:

  • len(list(element)) == 0: means node has no child eleemnts
  • path_holder is being used to hold the current location in the tree

code:

import xml.etree.ElementTree as ET

xml = '''<Data
    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
    <Header>
        <HeaderCode>My header</HeaderCode>
    </Header>
    <Customers>
        <Customer>
            <Code>4ESV1VPTKNW</Code>
            <Additional>
                <Info>PL</Info>
            </Additional>
        </Customer>
        <Customer>
            <Code>GNOAFMAPJIG</Code>
            <Additional>
                <Info>BL</Info>
            </Additional>
        </Customer>
    </Customers>
    <Trailer>
        <FileCreationDate>20200716</FileCreationDate>
        <RecordCount>10</RecordCount>
    </Trailer>
</Data>'''

terminal_nodes = set()
root = ET.fromstring(xml)
path = []
def collect_terminal_nodes(element,holder,path_holder):
    if len(list(element)) == 0:
        path_holder.append(element.tag)
        holder.add('/'.join(path_holder))
        path_holder.pop()
    else:
        path_holder.append(element.tag)
        for e in list(element):
            collect_terminal_nodes(e,holder,path_holder)
        path_holder.pop()

collect_terminal_nodes(root,terminal_nodes,path)
for idx,node in enumerate(terminal_nodes,1):
    print(f'{idx}) {node}')

output

1) Data/Trailer/RecordCount
2) Data/Customers/Customer/Additional/Info
3) Data/Header/HeaderCode
4) Data/Customers/Customer/Code
5) Data/Trailer/FileCreationDate

CodePudding user response:

Easiest way to test does node have child nodes it truth value test. bool(node) will be false if there's no child nodes. So we can do simple recursive generator function which will iterate over node and yield it's absolute path if node has no child:

def iter_childless(node, path=""):
    for sub_node in node:
        next_path = (path or node.tag)   "/"   sub_node.tag
        if sub_node:  # has child
            yield from iter_childless(sub_node, next_path)
        else:
            yield next_path

The easiest way to filter possible duplicates is to save results of iter_childless() into a set:

import xml.etree.ElementTree as ET

source = '''<Data
    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
    <Header>
        <HeaderCode>My header</HeaderCode>
    </Header>
    <Customers>
        <Customer>
            <Code>4ESV1VPTKNW</Code>
            <Additional>
                <Info>PL</Info>
            </Additional>
        </Customer>
        <Customer>
            <Code>GNOAFMAPJIG</Code>
            <Additional>
                <Info>BL</Info>
            </Additional>
        </Customer>
    </Customers>
    <Trailer>
        <FileCreationDate>20200716</FileCreationDate>
        <RecordCount>10</RecordCount>
    </Trailer>
</Data>'''

...

root = ET.fromstring(source)
childless_nodes = set(iter_childless(root))
print(*childless_nodes, sep="\n")

CodePudding user response:

Here is a way to do it with lxml that makes use of the getpath() function. A regular expression is used to remove positional predicates ([1], [2]) from the paths.

from lxml import etree
import re
 
re_predicate = re.compile('\[\d \]') 
 
tree = etree.parse('my_file.xml')
 
paths = []
 
for node in tree.xpath('//*'):
    if not list(node):               # getchildren() is deprecated
        p = tree.getpath(node)
        p = re_predicate.sub('', p)  # Remove [1], [2] etc. from the path
        paths.append(p)
 
for p in set(paths):                 # Get unique paths
    print(p)
  • Related