lxml XPath returns empty list-CodePudding

I am trying to parse the XML below and read its content. The XML in is bytes. All XPath queries return empty.

<?xml version="1.0" encoding="UTF-8"?>
<Invoice
    xmlns="urn:eslog:2.00"
    xmlns:in="http://uri.etsi.org/01903/v1.1.1#"
    xmlns:io="http://www.w3.org/2000/09/xmldsig#"
    xmlns:xs4xs="http://www.w3.org/2001/XMLSchema"
    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="urn:eslog:2.00 eSLOG20_INVOIC_v200.xsd">
    <M_INVOIC Id="data">
        <S_UNH>
            <D_0062>1889</D_0062>
            <C_S009>
                <D_0065>INVOIC</D_0065>
                <D_0052>D</D_0052>
                <D_0054>01B</D_0054>
                <D_0051>UN</D_0051>
            </C_S009>
        </S_UNH>
        <S_BGM>
            <C_C002>
                <D_1001>380</D_1001>
            </C_C002>
            <C_C106>
                <D_1004>1889</D_1004>
            </C_C106>
        </S_BGM>
    </M_INVOIC>
</Invoice>

Below is the code I tried with output in the comments. .fromstring() seem to work, since it returns something, but XPath queries return empty list. Your help is much appreciated.

def load(self, file):

        # print(file)  # b'<?xml version="1.0" ...
        # print(type(file))  # <class 'bytes'>

        root = etree.parse(BytesIO(file))
        # print(root.tag)  # Returns: object has no attribute 'tag'
        print(root.xpath('/Invoice'))  # Returns: []
        # print(root.nsmap)  # object has no attribute 'nsmap'

        root = etree.fromstring(file)
        print(root.tag)  # {urn:eslog:2.00}Invoice
        # print(root.xpath('/Invoice'))  # []
        # print(root.xpath('/{urn:eslog:2.00}Invoice')) # Invalid expression
        print(root.nsmap)  # {None: 'urn:eslog:2.00', 'in': ...
        print(root.xpath('/Invoce/M_INVOIC', nsmap = root.nsmap[None]))  # [] for all dict keys
        ns = { # a copy of root.nsmap
            'None': 'urn:eslog:2.00',
            'in': 'http://uri.etsi.org/01903/v1.1.1#',
            'io': 'http://www.w3.org/2000/09/xmldsig#',
            'xs4xs': 'http://www.w3.org/2001/XMLSchema',
            'xsi': 'http://www.w3.org/2001/XMLSchema-instance'}
        print(root.xpath('/Invoice/M_INVOIC/S_BGM/C_C106/D_1004', namespaces=ns)) # empty []
        print(root.xpath('/in:Invoice/M_INVOIC/S_BGM/C_C106/D_1004', namespaces=ns)) # empty []
        print(root.xpath('/io:Invoice/M_INVOIC/S_BGM/C_C106/D_1004', namespaces=ns)) # empty []
        print(root.xpath('/xs4xs:Invoice/M_INVOIC/S_BGM/C_C106/D_1004', namespaces=ns)) # empty []
        print(root.xpath('/xsi:Invoice/M_INVOIC/S_BGM/C_C106/D_1004', namespaces=ns)) # empty []

CodePudding user response：

The names of the elements in your XML document all belong to the namespace whose name is urn:eslog:2.00. That's because that namespace is declared as the "default" namespace (i.e. without a prefix), and none of the element names use a prefix.

Apart from the urn:eslog:2.00 namespace, there are 3 other namespace URIs bound to the prefixes in, io, and xs4xs, but those prefixes are not used anywhere in the document, so they are irrelevant.

The xs prefix is only used in the xsi:schemaLocation attribute of the root element, but none of your XPath expressions refer to that attribute.

So in your ns namespace map that defines the binding of namespace prefixes and URIs, I think you only need to declare a single namespace (i.e. the urn:eslog:2.00 namespace), and then use the prefix you've associated with that URI in your XPath expressions.

Currently you have bound that namespace URI to the prefix None, but your XPaths don't use the None prefix in any of elements it names. I think if you changed your XPaths to use that prefix then they should work, e.g.

/None:Invoice/None:M_INVOIC/None:S_BGM/None:C_C106/None:D_1004

Of course a better prefix than None would be e.g. eslog, but the prefix you choose is arbitrary: it's the namespace URI that matters; your XPaths need to use a prefix that's bound to that URI.