Home > Back-end >  Unexpected results when parsing XML via lxml
Unexpected results when parsing XML via lxml

Time:07-29

The output of my xml parsing is not es expected.

The xml file

<?xml version="1.0"?>
<stationaer xsi:schemaLocation="http:/foo.bar" xmlns="http://foo.bar" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
    <einrichtung>
        <name>Name</name>
    </einrichtung>
    <einrichtung>
        <name>Name</name>
    </einrichtung>
</stationaer>

I would expect to get something like root.tag == 'stationaer' and child.tag = 'einrichtung'. See the outpout at the end.

This is the MWE

#!/usr/bin/env python3
import pathlib
import lxml
from lxml import etree
import pandas

xml_src = '''<?xml version="1.0"?>
<stationaer xsi:schemaLocation="http:/foo.bar" xmlns="http://foo.bar" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
    <einrichtung>
        <name>Name</name>
    </einrichtung>
    <einrichtung>
        <name>Name</name>
    </einrichtung>
</stationaer>
'''

# tree = etree.parse(file_path)
# root = tree.getroot()
root = etree.fromstring(xml_src)

print(repr(root.tag))
print(repr(root.text))

child = root.getchildren()[0]

print(repr(child.tag))
print(repr(child.text))

The output for root is

'{http://foo.bar}stationaer'
'\n    '

and for child

'{http://foo.bar}einrichtung'
'\n        '

I don't understand what's going on here and why that URL is in the output.

CodePudding user response:

This is actually not unexpected. The elements in the XML document are bound to the http://foo.bar default namespace. The namespace is declared by xmlns="http://foo.bar" on the root element and the declaration is inherited by all descendants.

The special notation with the namespace URI enclosed in curly braces ({http://foo.bar}stationaer) is never used in XML documents, but it is used by lxml and ElementTree when printing element (tag) names. It can also be used when searching or creating elements that belong to a namespace.

More information:

  • Related