How to extract data from XML when some of the child tags and structure are unknown?-CodePudding

XML be like:

<Section>
    <ContainerBlockElement>
        <UnorderedList>
            <ListItem>
                <Paragraph>Download the software1 from: <URLLink LinkURL="www.software1.com"</URLLink></Paragraph>
            </ListItem>
        </UnorderedList>
        <UnorderedList>
            <ListItem>
                <Paragraph>Download the software2 from: <URLLink LinkURL="www.software2.com"</URLLink></Paragraph>
            </ListItem>
        </UnorderedList>
    </ContainerBlockElement>

    <ContainerBlockElement>
        <Paragraph>Apply the update in: <URLLink LinkURL="www.update.com"></URLLink></Paragraph>
    </ContainerBlockElement>

    <ContainerBlockElement>
        <Paragraph>Follow these rules:</Paragraph>
            <UnorderedList>
                <ListItem>Don't do this</ListItem>
                <ListItem>Don't do that</ListItem>
                <ListItem>Don't do blablabla</ListItem>
            </UnorderedList>
    </ContainerBlockElement>
</Section>

I want to extract all data within "ContainerBlockElement" in text, but the child tags and structure are different every time.

Expected output:

Download the software1 from: www.software1.com
Download the software2 from: www.software2.com
Apply the update in: www.update.com
Follow these rules:
Don't do this
Don't do that
Don't do blablabla

Is hard code the only way? But I can't even think of a way to hard code.

for content in tree.findall(
                    ".//Section/ContainerBlockElement/UnorderedList/ListItem/Paragraph"):
    print(content)
for content in tree.findall(".//Section/ContainerBlockElement/Paragraph"):
    print(content)
etc...

CodePudding user response：

First you example contains error in URLLINK

<URLLink LinkURL="www.software1.com"</URLLink>

will be

<URLLink LinkURL="www.software1.com"/>

For full example :

<Section>
    <ContainerBlockElement>
        <UnorderedList>
            <ListItem>
                <Paragraph>Download the software1 from: <URLLink LinkURL="www.software1.com"/></Paragraph>
            </ListItem>
        </UnorderedList>
        <UnorderedList>
            <ListItem>
                <Paragraph>Download the software2 from: <URLLink LinkURL="www.software2.com"/></Paragraph>
            </ListItem>
        </UnorderedList>
    </ContainerBlockElement>

    <ContainerBlockElement>
        <Paragraph>Apply the update in: <URLLink LinkURL="www.update.com"/></Paragraph>
    </ContainerBlockElement>

    <ContainerBlockElement>
        <Paragraph>Follow these rules:</Paragraph>
            <UnorderedList>
                <ListItem>Don't do this</ListItem>
                <ListItem>Don't do that</ListItem>
                <ListItem>Don't do blablabla</ListItem>
            </UnorderedList>
    </ContainerBlockElement>
</Section>

About extraction data you can do like this:

from xml.etree import ElementTree as ET
tree = ET.parse('test.xml')
root = tree.getroot()
results =  root.findall('ContainerBlockElement/UnorderedList/ListItem')   root.findall('ContainerBlockElement')    root.findall('ContainerBlockElement/UnorderedList') 
for elem in results:
    for e in elem:
        if (len(e.text.strip()) == 0):
            continue
        URLLINK_Data = e.find('./URLLink')
        if URLLINK_Data is None:
            print(e.text.strip())
        else:
            print(e.text.strip()  " "  e.find('./URLLink').attrib['LinkURL'])

Output :

Download the software1 from: www.software1.com
Download the software2 from: www.software2.com
Apply the update in: www.update.com
Follow these rules:
Don't do this
Don't do that
Don't do blablabla

CodePudding user response：

In addition to the correction mentioned in @ACHRAF's answer, I would also suggest an altenative using lxml instead of ElementTree, because of lxml's better support for xpath:

from lxml import etree
doc = etree.parse('file.xml')
for entry in doc.xpath('//Paragraph'):
    link_target = entry.xpath('./URLLink/@LinkURL')
    ul_target = entry.xpath('./following-sibling::UnorderedList//text()')

    link = link_target[0] if link_target else ''
    ul = " ".join(ul_target) if ul_target  else ''

    print(entry.text,link,ul)

Output:

Download the software1 from:  www.software1.com 
Download the software2 from:  www.software2.com 
Apply the update in:  www.update.com 
Follow these rules:  
                 Don't do this 
                 Don't do that 
                 Don't do blablabla

CodePudding user response：

To get the elements that have actual text or URLLink use this XPath

/Section/ContainerBlockElement//*[URLLink or text()[normalize-space()]]

The * symbolises an element-node.

The [URLLink or text()[normalize-space()]] is a predicate that filters for elements that have a direct URLLink element or text() as child with has more than just white-space

Then use python to extract both text() and URLLink