Home > Net >  Python lxml not read XML properlly
Python lxml not read XML properlly

Time:05-28

I am using Python 2.7 (I can not upgrade to any new version sadly) and I am trying to parse 2 XML files, using lxml but something is not right and I am not sure what I am doing wrong:

CODE:

from lxml import etree as ET

def string_to_lxml(string):
    xml_file = bytes(bytearray(string, encoding='utf-8'))
    return ET.XML(xml_file)


def find_all(tag, atr):
    return tag.xpath("//%s" % atr)

xml_str_1 = """<?xml version="1.0" encoding="UTF-8"?>
<A xmlns="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" version="1.0">
    <B name="SOME_NAME_0">
        <C/>
        <D>SOME NAME</D>
        <AA>
            <dir name="include" filters="*.h *.hpp *.tpp *.i"/>
        </AA>
        <H>
            <TAG_1 name="main" default="true"/>
        </H>
    </B>
    <TT>
        <GG>
            <FF configs="main">
                <TAG_2 name="NAME_1"/>
                <TAG_2 name="NAME_2"/>
                <TAG_3 name="NAME_3"/>
                <TAG_3 name="NAME_4"/>
                <TAG_3 name="NAME_5"/>
            </FF>
        </GG>
    </TT>
</A>"""

xml_str_2 = """<?xml version='1.0' encoding='UTF-8'?>
<A xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:noNamespaceSchemaLocation="http://obe.nce.amadeus.net/bms/metadata/1-0/">
    <B name="NAME" version="VERSION">
        <AA>SOME NAME</AA>
        <CC>SOME OTHER NAME</CC>
    </B>
    <C>
        <TAG_3 name="NAME_1" path="path_1"/>
        <TAG_3 name="NAME_2" path="path_2"/>
        <TAG_3 name="NAME_3" path="path_3"/>
    </C>
    <D>
        <TAG_3 type="type" name="NAME_1" version="version_1"/>
        <TAG_3 type="type" name="NAME_2" version="version_2"/>
        <TAG_3 type="type" name="NAME_3" version="version_3"/>
    </D>
</A>
"""
root = string_to_lxml(xml_str_1)
print(find_all(root, "TAG_3"))

root = string_to_lxml(xml_str_2)
print(find_all(root, "TAG_3"))

Output:

[]
[<Element TAG_3 at 0x7f257c126640>, <Element TAG_3 at 0x7f257c126be0>, <Element TAG_3 at 0x7f257c126b90>, <Element TAG_3 at 0x7f257c126e10>, <Element TAG_3 at 0x7f257c128730>, <Element TAG_3 at 0x7f257c128640>]

Did I parse the XML in a wrong way?

CodePudding user response:

First XML defines an anonymous namespace that must be taken into account
xmlns="http://www.w3.org/2001/XMLSchema-instance"
For that, the xpath expression can be expressed as follows

def find_all(tag, atr):
    return tag.xpath("//*[local-name()= '%s']" % atr)

Result:

[<Element {http://www.w3.org/2001/XMLSchema-instance}TAG_3 at 0x7f39cf73de88>, <Element {http://www.w3.org/2001/XMLSchema-instance}TAG_3 at 0x7f39cf73df88>, <Element {http://www.w3.org/2001/XMLSchema-instance}TAG_3 at 0x7f39cf73dfc8>]
[<Element TAG_3 at 0x7f39cf73df88>, <Element TAG_3 at 0x7f39cf73dfc8>, <Element TAG_3 at 0x7f39cf73dec8>, <Element TAG_3 at 0x7f39cf762048>, <Element TAG_3 at 0x7f39cf762088>, <Element TAG_3 at 0x7f39cf762108>]
  • Related