In below XML, I need to extract the BinaryImage
if the ImageType
is fullimage
.
<?xml version="1.0" encoding="UTF-8"?>
<soapenv:Envelope xmlns:soapenv="http://schemas.xmlsoap.org/soap/envelope/">
<soapenv:Header />
<soapenv:Body>
<Instation xmlns="http://ffsf.us.com/schema_1-2" SchemaVersion="1.2">
<ImageArray>
<Image>
<InstanceID>5216</InstanceID>
<TimeStamp>2022-12-01T10:34:24.499Z</TimeStamp>
<LaneID>0</LaneID>
<ImageType>fullImage</ImageType>
<ImageFormat>jpeg</ImageFormat>
<BinaryImage>abcd</BinaryImage>
</Image>
<Image>
<InstanceID>5216</InstanceID>
<TimeStamp>2022-12-01T10:34:24.499Z</TimeStamp>
<LaneID>0</LaneID>
<ImageType>Patch</ImageType>
<ImageFormat>jpeg</ImageFormat>
<BinaryImage>abcd</BinaryImage>
</Image>
</ImageArray>
</Instation>
</soapenv:Body>
</soapenv:Envelope>
I tried with findall
and xpath
but it gave the following errors:
root.findall(".//{http://ffsf.us.com/schema_1-2}Image[contains(@ImageType,'fullImage')]")
root.xpath(".//{http://ffsf.us.com/schema_1-2}Image[contains(@ImageType,'fullImage')]")
root.xpath(".//{http://ffsf.us.com/schema_1-2}BinaryImage[contains(@ImageType,'fullImage')]")
root.xpath(".//{http://ffsf.us.com/schema_1-2}BinaryImage[@ImageType='fullImage']")
root.xpath(".//{http://ffsf.us.com/schema_1-2}Image[@ImageType='fullImage']")
lxml.etree.XPathEvalError: Invalid expression
SyntaxError: invalid predicate
The documentation does not seem to be very helpful, what am I doing wrong?
CodePudding user response:
The below should work
import xml.etree.ElementTree as ET
xml = '''<soapenv:Envelope xmlns:soapenv="http://schemas.xmlsoap.org/soap/envelope/">
<soapenv:Header />
<soapenv:Body>
<Instation xmlns="http://ffsf.us.com/schema_1-2" SchemaVersion="1.2">
<ImageArray>
<Image>
<InstanceID>5216</InstanceID>
<TimeStamp>2022-12-01T10:34:24.499Z</TimeStamp>
<LaneID>0</LaneID>
<ImageType>fullImage</ImageType>
<ImageFormat>jpeg</ImageFormat>
<BinaryImage>abcd</BinaryImage>
</Image>
<Image>
<InstanceID>5216</InstanceID>
<TimeStamp>2022-12-01T10:34:24.499Z</TimeStamp>
<LaneID>0</LaneID>
<ImageType>Patch</ImageType>
<ImageFormat>jpeg</ImageFormat>
<BinaryImage>abcd</BinaryImage>
</Image>
</ImageArray>
</Instation>
</soapenv:Body>
</soapenv:Envelope>'''
root = ET.fromstring(xml)
binary_images = [im.find('{http://ffsf.us.com/schema_1-2}BinaryImage').text for im in root.findall('.//{http://ffsf.us.com/schema_1-2}Image') if im.find('{http://ffsf.us.com/schema_1-2}ImageType').text == 'fullImage']
print(binary_images)
output
['abcd']
CodePudding user response:
Since you need to parse against a default namespace, consider using the namespaces
argument available in both findall
and xpath
where you can map the URI to a user-defined prefix (e.g., doc
) using a dictionary to be used on all elements in XPath expression.
Additionally, your XPath must be adjusted without @
since no attributes are included.
import lxml.etree as lx
doc = lx.parse("Input.xml")
nmsp = {"doc": "http://ffsf.us.com/schema_1-2"}
xpr = ".//doc:Image[doc:ImageType='fullImage']/doc:BinaryImage"
images_findall = [d.text for d in doc.findall(xpr, namespaces=nmsp)]
print(images_findall)
['abcd']
images_xpath = [d.text for d in doc.xpath(xpr, namespaces=nmsp)]
print(images_xpath)
['abcd']
Do note: findall
only supports very simple XPath such as above and not the fuller XPath 1.0 specification like xpath
. For example, you could have also used preceding-sibling
axis:
xpr = ".//doc:Image/doc:BinaryImage[preceding-sibling::doc:ImageType='fullImage']"
images_xpath = [d.text for d in doc.xpath(xpr, namespaces=nmsp)]
print(images_xpath)
['abcd']