How to get all XPaths from XML with just key names and no template URLs, with Python-CodePudding

I need to extract XPaths and values from XML object. Currently I use lxml which with either gives long paths with repeated template URLS or just indices of XPaths keys without names.

Question: How to get Xpaths with just names, without template URLs. Yes, string cleanup after parsing works, but I hope to find a clean solution using lxml or similar library

with getelementpath(): has template URLs and '\n\t\t' in empty keys.

>> [(root1.getelementpath(e), e.text) for e in root1.iter()][5:10]

[('{http://schemas.oceanehr.com/templates}language/{http://schemas.oceanehr.com/templates}terminology_id/{http://schemas.oceanehr.com/templates}value',
  'ISO_639-1'),
 ('{http://schemas.oceanehr.com/templates}language/{http://schemas.oceanehr.com/templates}code_string',
  'xx'),
 ('{http://schemas.oceanehr.com/templates}territory', '\n\t\t'),
 ('{http://schemas.oceanehr.com/templates}territory/{http://schemas.oceanehr.com/templates}terminology_id',
  '\n\t\t\t'),
 ('{http://schemas.oceanehr.com/templates}territory/{http://schemas.oceanehr.com/templates}terminology_id/{http://schemas.oceanehr.com/templates}value',
  'ISO_3166-1')]

with getpath(): has no key names URLs and '\n\t\t' in empty keys.

>> [(root1.getpath(e), e.text) for e in root1.iter()][5:10]

[('/*/*[2]/*[1]/*', 'ISO_639-1'),
 ('/*/*[2]/*[2]', 'xx'),
 ('/*/*[3]', '\n\t\t'),
 ('/*/*[3]/*[1]', '\n\t\t\t'),
 ('/*/*[3]/*[1]/*', 'ISO_3166-1')]

what I need: key names URLs and None in empty keys. I believe I've seen it somewhere, but can't find now...

[('language/terminology_id/value', 'ISO_639-1'),
('language/code_string','xx'),
('territory', None),
('territory/terminology_id', None),
('territory/terminology_id/value', 'ISO_3166-1')]

this is the XML header:

<?xml version="1.0" ?>
<Lab test results
        xmlns="http://schemas.oceanehr.com/templates"
        xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
        xmlns:rm="http://schemas.openehr.org/v1"
        template_id="openEHR-EHR-COMPOSITION.t_laboratory_test_result_report.v2.1">
    <name>
        <value>Lab test results</value>
    </name>
    <language>
        <terminology_id>
            <value>ISO_639-1</value>
        </terminology_id>
        <code_string>ru</code_string>

CodePudding user response：

I'd still use .getpath().

The reason you're getting * in your paths is because your XML has a default namespace. By using * the namespace doesn't need to be taken into account when creating a usable xpath.

To resolve this, first set the element name (.tag) to the local-name (element name without prefix or uri).

Also, you can create an XMLParser and set remove_blank_text to True to get rid of the entries that are only whitespace.

Example...

XML Input (test.xml)

<Lab_test_results
        xmlns="http://schemas.oceanehr.com/templates"
        xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
        xmlns:rm="http://schemas.openehr.org/v1"
        template_id="openEHR-EHR-COMPOSITION.t_laboratory_test_result_report.v2.1">
    <name>
        <value>Lab test results</value>
    </name>
    <language>
        <terminology_id>
            <value>ISO_639-1</value>
        </terminology_id>
    </language>
</Lab_test_results>

Python

from lxml import etree
from pprint import pprint

parser = etree.XMLParser(remove_blank_text=True)

tree = etree.parse('test.xml', parser=parser)

xpaths = []

for elem in tree.iter():
    elem.tag = etree.QName(elem).localname
    xpaths.append((tree.getpath(elem), elem.text))

pprint(xpaths)

Printed Output

[('/Lab_test_results', None),
 ('/Lab_test_results/name', None),
 ('/Lab_test_results/name/value', 'Lab test results'),
 ('/Lab_test_results/language', None),
 ('/Lab_test_results/language/terminology_id', None),
 ('/Lab_test_results/language/terminology_id/value', 'ISO_639-1')]

If you need to also collect attributes, you can make a few small changes...

for elem in tree.iter():
    elem.tag = etree.QName(elem).localname
    xpath = tree.getpath(elem)
    xpaths.append((xpath, elem.text))
    for attr in elem.attrib:
        xpaths.append((f"{xpath}/@{attr}", elem.get(attr)))