XML be like:
<Section>
<ContainerBlockElement>
<UnorderedList>
<ListItem>
<Paragraph>Download the software1 from: <URLLink LinkURL="www.software1.com"</URLLink></Paragraph>
</ListItem>
</UnorderedList>
<UnorderedList>
<ListItem>
<Paragraph>Download the software2 from: <URLLink LinkURL="www.software2.com"</URLLink></Paragraph>
</ListItem>
</UnorderedList>
</ContainerBlockElement>
<ContainerBlockElement>
<Paragraph>Apply the update in: <URLLink LinkURL="www.update.com"></URLLink></Paragraph>
</ContainerBlockElement>
<ContainerBlockElement>
<Paragraph>Follow these rules:</Paragraph>
<UnorderedList>
<ListItem>Don't do this</ListItem>
<ListItem>Don't do that</ListItem>
<ListItem>Don't do blablabla</ListItem>
</UnorderedList>
</ContainerBlockElement>
</Section>
I want to extract all data within "ContainerBlockElement"
in text, but the child tags and structure are different every time.
Expected output:
Download the software1 from: www.software1.com
Download the software2 from: www.software2.com
Apply the update in: www.update.com
Follow these rules:
Don't do this
Don't do that
Don't do blablabla
Is hard code the only way? But I can't even think of a way to hard code.
for content in tree.findall(
".//Section/ContainerBlockElement/UnorderedList/ListItem/Paragraph"):
print(content)
for content in tree.findall(".//Section/ContainerBlockElement/Paragraph"):
print(content)
etc...
CodePudding user response:
First you example contains error in URLLINK
<URLLink LinkURL="www.software1.com"</URLLink>
will be
<URLLink LinkURL="www.software1.com"/>
For full example :
<Section>
<ContainerBlockElement>
<UnorderedList>
<ListItem>
<Paragraph>Download the software1 from: <URLLink LinkURL="www.software1.com"/></Paragraph>
</ListItem>
</UnorderedList>
<UnorderedList>
<ListItem>
<Paragraph>Download the software2 from: <URLLink LinkURL="www.software2.com"/></Paragraph>
</ListItem>
</UnorderedList>
</ContainerBlockElement>
<ContainerBlockElement>
<Paragraph>Apply the update in: <URLLink LinkURL="www.update.com"/></Paragraph>
</ContainerBlockElement>
<ContainerBlockElement>
<Paragraph>Follow these rules:</Paragraph>
<UnorderedList>
<ListItem>Don't do this</ListItem>
<ListItem>Don't do that</ListItem>
<ListItem>Don't do blablabla</ListItem>
</UnorderedList>
</ContainerBlockElement>
</Section>
About extraction data you can do like this:
from xml.etree import ElementTree as ET
tree = ET.parse('test.xml')
root = tree.getroot()
results = root.findall('ContainerBlockElement/UnorderedList/ListItem') root.findall('ContainerBlockElement') root.findall('ContainerBlockElement/UnorderedList')
for elem in results:
for e in elem:
if (len(e.text.strip()) == 0):
continue
URLLINK_Data = e.find('./URLLink')
if URLLINK_Data is None:
print(e.text.strip())
else:
print(e.text.strip() " " e.find('./URLLink').attrib['LinkURL'])
Output :
Download the software1 from: www.software1.com
Download the software2 from: www.software2.com
Apply the update in: www.update.com
Follow these rules:
Don't do this
Don't do that
Don't do blablabla
CodePudding user response:
In addition to the correction mentioned in @ACHRAF's answer, I would also suggest an altenative using lxml instead of ElementTree, because of lxml's better support for xpath:
from lxml import etree
doc = etree.parse('file.xml')
for entry in doc.xpath('//Paragraph'):
link_target = entry.xpath('./URLLink/@LinkURL')
ul_target = entry.xpath('./following-sibling::UnorderedList//text()')
link = link_target[0] if link_target else ''
ul = " ".join(ul_target) if ul_target else ''
print(entry.text,link,ul)
Output:
Download the software1 from: www.software1.com
Download the software2 from: www.software2.com
Apply the update in: www.update.com
Follow these rules:
Don't do this
Don't do that
Don't do blablabla
CodePudding user response:
To get the elements that have actual text or URLLink use this XPath
/Section/ContainerBlockElement//*[URLLink or text()[normalize-space()]]
The *
symbolises an element-node.
The [URLLink or text()[normalize-space()]]
is a predicate that filters for elements that have a direct URLLink element or text() as child with has more than just white-space
Then use python to extract both text() and URLLink