Home > Software design >  How to extract data from XML when some of the child tags and structure are unknown?
How to extract data from XML when some of the child tags and structure are unknown?

Time:05-11

XML be like:

<Section>
    <ContainerBlockElement>
        <UnorderedList>
            <ListItem>
                <Paragraph>Download the software1 from: <URLLink LinkURL="www.software1.com"</URLLink></Paragraph>
            </ListItem>
        </UnorderedList>
        <UnorderedList>
            <ListItem>
                <Paragraph>Download the software2 from: <URLLink LinkURL="www.software2.com"</URLLink></Paragraph>
            </ListItem>
        </UnorderedList>
    </ContainerBlockElement>

    <ContainerBlockElement>
        <Paragraph>Apply the update in: <URLLink LinkURL="www.update.com"></URLLink></Paragraph>
    </ContainerBlockElement>

    <ContainerBlockElement>
        <Paragraph>Follow these rules:</Paragraph>
            <UnorderedList>
                <ListItem>Don't do this</ListItem>
                <ListItem>Don't do that</ListItem>
                <ListItem>Don't do blablabla</ListItem>
            </UnorderedList>
    </ContainerBlockElement>
</Section>

I want to extract all data within "ContainerBlockElement" in text, but the child tags and structure are different every time.

Expected output:

Download the software1 from: www.software1.com
Download the software2 from: www.software2.com
Apply the update in: www.update.com
Follow these rules:
Don't do this
Don't do that
Don't do blablabla

Is hard code the only way? But I can't even think of a way to hard code.

for content in tree.findall(
                    ".//Section/ContainerBlockElement/UnorderedList/ListItem/Paragraph"):
    print(content)
for content in tree.findall(".//Section/ContainerBlockElement/Paragraph"):
    print(content)
etc...

CodePudding user response:

First you example contains error in URLLINK

<URLLink LinkURL="www.software1.com"</URLLink>

will be

<URLLink LinkURL="www.software1.com"/>

For full example :

<Section>
    <ContainerBlockElement>
        <UnorderedList>
            <ListItem>
                <Paragraph>Download the software1 from: <URLLink LinkURL="www.software1.com"/></Paragraph>
            </ListItem>
        </UnorderedList>
        <UnorderedList>
            <ListItem>
                <Paragraph>Download the software2 from: <URLLink LinkURL="www.software2.com"/></Paragraph>
            </ListItem>
        </UnorderedList>
    </ContainerBlockElement>

    <ContainerBlockElement>
        <Paragraph>Apply the update in: <URLLink LinkURL="www.update.com"/></Paragraph>
    </ContainerBlockElement>

    <ContainerBlockElement>
        <Paragraph>Follow these rules:</Paragraph>
            <UnorderedList>
                <ListItem>Don't do this</ListItem>
                <ListItem>Don't do that</ListItem>
                <ListItem>Don't do blablabla</ListItem>
            </UnorderedList>
    </ContainerBlockElement>
</Section>

About extraction data you can do like this:

from xml.etree import ElementTree as ET
tree = ET.parse('test.xml')
root = tree.getroot()
results =  root.findall('ContainerBlockElement/UnorderedList/ListItem')   root.findall('ContainerBlockElement')    root.findall('ContainerBlockElement/UnorderedList') 
for elem in results:
    for e in elem:
        if (len(e.text.strip()) == 0):
            continue
        URLLINK_Data = e.find('./URLLink')
        if URLLINK_Data is None:
            print(e.text.strip())
        else:
            print(e.text.strip()  " "  e.find('./URLLink').attrib['LinkURL'])

Output :

Download the software1 from: www.software1.com
Download the software2 from: www.software2.com
Apply the update in: www.update.com
Follow these rules:
Don't do this
Don't do that
Don't do blablabla

CodePudding user response:

In addition to the correction mentioned in @ACHRAF's answer, I would also suggest an altenative using lxml instead of ElementTree, because of lxml's better support for xpath:

from lxml import etree
doc = etree.parse('file.xml')
for entry in doc.xpath('//Paragraph'):
    link_target = entry.xpath('./URLLink/@LinkURL')
    ul_target = entry.xpath('./following-sibling::UnorderedList//text()')

    link = link_target[0] if link_target else ''
    ul = " ".join(ul_target) if ul_target  else ''

    print(entry.text,link,ul)

Output:

Download the software1 from:  www.software1.com 
Download the software2 from:  www.software2.com 
Apply the update in:  www.update.com 
Follow these rules:  
                 Don't do this 
                 Don't do that 
                 Don't do blablabla 

CodePudding user response:

To get the elements that have actual text or URLLink use this XPath

/Section/ContainerBlockElement//*[URLLink or text()[normalize-space()]]

The * symbolises an element-node.

The [URLLink or text()[normalize-space()]] is a predicate that filters for elements that have a direct URLLink element or text() as child with has more than just white-space

Then use python to extract both text() and URLLink

  • Related