How to filter child element by putting condition on another child element in XML-CodePudding

In below XML, I need to extract the BinaryImage if the ImageType is fullimage.

<?xml version="1.0" encoding="UTF-8"?>
<soapenv:Envelope xmlns:soapenv="http://schemas.xmlsoap.org/soap/envelope/">
   <soapenv:Header />
   <soapenv:Body>
      <Instation xmlns="http://ffsf.us.com/schema_1-2" SchemaVersion="1.2">
         <ImageArray>
            <Image>
               <InstanceID>5216</InstanceID>
               <TimeStamp>2022-12-01T10:34:24.499Z</TimeStamp>
               <LaneID>0</LaneID>
               <ImageType>fullImage</ImageType>
               <ImageFormat>jpeg</ImageFormat>
               <BinaryImage>abcd</BinaryImage>
            </Image>
            <Image>
               <InstanceID>5216</InstanceID>
               <TimeStamp>2022-12-01T10:34:24.499Z</TimeStamp>
               <LaneID>0</LaneID>
               <ImageType>Patch</ImageType>
               <ImageFormat>jpeg</ImageFormat>
               <BinaryImage>abcd</BinaryImage>
            </Image>
         </ImageArray>
      </Instation>
   </soapenv:Body>
</soapenv:Envelope>

I tried with findall and xpath but it gave the following errors:

root.findall(".//{http://ffsf.us.com/schema_1-2}Image[contains(@ImageType,'fullImage')]")

root.xpath(".//{http://ffsf.us.com/schema_1-2}Image[contains(@ImageType,'fullImage')]")

root.xpath(".//{http://ffsf.us.com/schema_1-2}BinaryImage[contains(@ImageType,'fullImage')]")

root.xpath(".//{http://ffsf.us.com/schema_1-2}BinaryImage[@ImageType='fullImage']")

root.xpath(".//{http://ffsf.us.com/schema_1-2}Image[@ImageType='fullImage']")

lxml.etree.XPathEvalError: Invalid expression

SyntaxError: invalid predicate

The documentation does not seem to be very helpful, what am I doing wrong?

CodePudding user response：

The below should work

import xml.etree.ElementTree as ET


xml = '''<soapenv:Envelope xmlns:soapenv="http://schemas.xmlsoap.org/soap/envelope/">
   <soapenv:Header />
   <soapenv:Body>
      <Instation xmlns="http://ffsf.us.com/schema_1-2" SchemaVersion="1.2">
         <ImageArray>
            <Image>
               <InstanceID>5216</InstanceID>
               <TimeStamp>2022-12-01T10:34:24.499Z</TimeStamp>
               <LaneID>0</LaneID>
               <ImageType>fullImage</ImageType>
               <ImageFormat>jpeg</ImageFormat>
               <BinaryImage>abcd</BinaryImage>
            </Image>
            <Image>
               <InstanceID>5216</InstanceID>
               <TimeStamp>2022-12-01T10:34:24.499Z</TimeStamp>
               <LaneID>0</LaneID>
               <ImageType>Patch</ImageType>
               <ImageFormat>jpeg</ImageFormat>
               <BinaryImage>abcd</BinaryImage>
            </Image>
         </ImageArray>
      </Instation>
   </soapenv:Body>
</soapenv:Envelope>'''


root = ET.fromstring(xml)
binary_images = [im.find('{http://ffsf.us.com/schema_1-2}BinaryImage').text for im in root.findall('.//{http://ffsf.us.com/schema_1-2}Image') if im.find('{http://ffsf.us.com/schema_1-2}ImageType').text == 'fullImage']
print(binary_images)

output

['abcd']

CodePudding user response：

Since you need to parse against a default namespace, consider using the namespaces argument available in both findall and xpath where you can map the URI to a user-defined prefix (e.g., doc) using a dictionary to be used on all elements in XPath expression.

Additionally, your XPath must be adjusted without @ since no attributes are included.

import lxml.etree as lx

doc = lx.parse("Input.xml")

nmsp = {"doc": "http://ffsf.us.com/schema_1-2"}
xpr = ".//doc:Image[doc:ImageType='fullImage']/doc:BinaryImage"

images_findall = [d.text for d in doc.findall(xpr, namespaces=nmsp)]
print(images_findall)
['abcd']

images_xpath = [d.text for d in doc.xpath(xpr, namespaces=nmsp)]
print(images_xpath)
['abcd']

Do note: findall only supports very simple XPath such as above and not the fuller XPath 1.0 specification like xpath. For example, you could have also used preceding-sibling axis:

xpr = ".//doc:Image/doc:BinaryImage[preceding-sibling::doc:ImageType='fullImage']"

images_xpath = [d.text for d in doc.xpath(xpr, namespaces=nmsp)]
print(images_xpath)
['abcd']