I am using Python 2.7 (I can not upgrade to any new version sadly) and I am trying to parse 2 XML files, using lxml
but something is not right and I am not sure what I am doing wrong:
CODE:
from lxml import etree as ET
def string_to_lxml(string):
xml_file = bytes(bytearray(string, encoding='utf-8'))
return ET.XML(xml_file)
def find_all(tag, atr):
return tag.xpath("//%s" % atr)
xml_str_1 = """<?xml version="1.0" encoding="UTF-8"?>
<A xmlns="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" version="1.0">
<B name="SOME_NAME_0">
<C/>
<D>SOME NAME</D>
<AA>
<dir name="include" filters="*.h *.hpp *.tpp *.i"/>
</AA>
<H>
<TAG_1 name="main" default="true"/>
</H>
</B>
<TT>
<GG>
<FF configs="main">
<TAG_2 name="NAME_1"/>
<TAG_2 name="NAME_2"/>
<TAG_3 name="NAME_3"/>
<TAG_3 name="NAME_4"/>
<TAG_3 name="NAME_5"/>
</FF>
</GG>
</TT>
</A>"""
xml_str_2 = """<?xml version='1.0' encoding='UTF-8'?>
<A xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:noNamespaceSchemaLocation="http://obe.nce.amadeus.net/bms/metadata/1-0/">
<B name="NAME" version="VERSION">
<AA>SOME NAME</AA>
<CC>SOME OTHER NAME</CC>
</B>
<C>
<TAG_3 name="NAME_1" path="path_1"/>
<TAG_3 name="NAME_2" path="path_2"/>
<TAG_3 name="NAME_3" path="path_3"/>
</C>
<D>
<TAG_3 type="type" name="NAME_1" version="version_1"/>
<TAG_3 type="type" name="NAME_2" version="version_2"/>
<TAG_3 type="type" name="NAME_3" version="version_3"/>
</D>
</A>
"""
root = string_to_lxml(xml_str_1)
print(find_all(root, "TAG_3"))
root = string_to_lxml(xml_str_2)
print(find_all(root, "TAG_3"))
Output:
[]
[<Element TAG_3 at 0x7f257c126640>, <Element TAG_3 at 0x7f257c126be0>, <Element TAG_3 at 0x7f257c126b90>, <Element TAG_3 at 0x7f257c126e10>, <Element TAG_3 at 0x7f257c128730>, <Element TAG_3 at 0x7f257c128640>]
Did I parse the XML in a wrong way?
CodePudding user response:
First XML defines an anonymous namespace that must be taken into account
xmlns="http://www.w3.org/2001/XMLSchema-instance"
For that, the xpath expression can be expressed as follows
def find_all(tag, atr):
return tag.xpath("//*[local-name()= '%s']" % atr)
Result:
[<Element {http://www.w3.org/2001/XMLSchema-instance}TAG_3 at 0x7f39cf73de88>, <Element {http://www.w3.org/2001/XMLSchema-instance}TAG_3 at 0x7f39cf73df88>, <Element {http://www.w3.org/2001/XMLSchema-instance}TAG_3 at 0x7f39cf73dfc8>]
[<Element TAG_3 at 0x7f39cf73df88>, <Element TAG_3 at 0x7f39cf73dfc8>, <Element TAG_3 at 0x7f39cf73dec8>, <Element TAG_3 at 0x7f39cf762048>, <Element TAG_3 at 0x7f39cf762088>, <Element TAG_3 at 0x7f39cf762108>]