Home > Blockchain >  Iterate on XML tags and get elements' xpath in Python
Iterate on XML tags and get elements' xpath in Python

Time:12-31

I want to iterate on every "p" tags in a XML document and be able to get the current element's xpath but I don't find anything that does it.

The kind of code I tried:

from bs4 import BeautifulSoup

xml_file = open("./data.xml", "rb")
soup = BeautifulSoup(xml_file, "lxml")

for i in soup.find_all("p"):
    print(i.xpath) # xpath doesn't work here (None)
    print("\n")

Here is a sample XML file that I try to parse:

<?xml version="1.0" encoding="UTF-8"?>

<article>
    <title>Sample document</title>
    <body>
        <p>This is a <b>sample document.</b></p>
        <p>And there is another paragraph.</p>
    </body>
</article>

I would like my code to output:

/article/body/p[0]
/article/body/p[1]

CodePudding user response:

You can use getpath() to get xpath from element:

result = root.xpath('//*[. = "XML"]')
for r in result:
    print(tree.getpath(r))

you can try to use this function:

doc = etree.fromstring(xml)
btags = doc.xpath('//a/b')
for b in btags:
    print b.text



def fast_iter(context, func, *args, **kwargs):
    """
    fast_iter is useful if you need to free memory while iterating through a
    very large XML file.

    http://lxml.de/parsing.html#modifying-the-tree
    Based on Liza Daly's fast_iter
    http://www.ibm.com/developerworks/xml/library/x-hiperfparse/
    See also http://effbot.org/zone/element-iterparse.htm
    """
    for event, elem in context:
        func(elem, *args, **kwargs)
        # It's safe to call clear() here because no descendants will be
        # accessed
        elem.clear()
        # Also eliminate now-empty references from the root node to elem
        for ancestor in elem.xpath('ancestor-or-self::*'):
            while ancestor.getprevious() is not None:
                del ancestor.getparent()[0]
    del context

def process_element(elt):
    print(elt.text)

context=etree.iterparse(io.BytesIO(xml), events=('end',), tag='b')
fast_iter(context, process_element)

for more reference you can look here - https://newbedev.com/efficient-way-to-iterate-through-xml-elements

  • Related