Home > OS >  How to get all XPaths from XML with just key names and no template URLs, with Python
How to get all XPaths from XML with just key names and no template URLs, with Python

Time:01-29

I need to extract XPaths and values from XML object. Currently I use lxml which with either gives long paths with repeated template URLS or just indices of XPaths keys without names.

Question: How to get Xpaths with just names, without template URLs. Yes, string cleanup after parsing works, but I hope to find a clean solution using lxml or similar library

  1. with getelementpath(): has template URLs and '\n\t\t' in empty keys.
>> [(root1.getelementpath(e), e.text) for e in root1.iter()][5:10]

[('{http://schemas.oceanehr.com/templates}language/{http://schemas.oceanehr.com/templates}terminology_id/{http://schemas.oceanehr.com/templates}value',
  'ISO_639-1'),
 ('{http://schemas.oceanehr.com/templates}language/{http://schemas.oceanehr.com/templates}code_string',
  'xx'),
 ('{http://schemas.oceanehr.com/templates}territory', '\n\t\t'),
 ('{http://schemas.oceanehr.com/templates}territory/{http://schemas.oceanehr.com/templates}terminology_id',
  '\n\t\t\t'),
 ('{http://schemas.oceanehr.com/templates}territory/{http://schemas.oceanehr.com/templates}terminology_id/{http://schemas.oceanehr.com/templates}value',
  'ISO_3166-1')]
  1. with getpath(): has no key names URLs and '\n\t\t' in empty keys.
>> [(root1.getpath(e), e.text) for e in root1.iter()][5:10]

[('/*/*[2]/*[1]/*', 'ISO_639-1'),
 ('/*/*[2]/*[2]', 'xx'),
 ('/*/*[3]', '\n\t\t'),
 ('/*/*[3]/*[1]', '\n\t\t\t'),
 ('/*/*[3]/*[1]/*', 'ISO_3166-1')]
  1. what I need: key names URLs and None in empty keys. I believe I've seen it somewhere, but can't find now...
[('language/terminology_id/value', 'ISO_639-1'),
('language/code_string','xx'),
('territory', None),
('territory/terminology_id', None),
('territory/terminology_id/value', 'ISO_3166-1')]

this is the XML header:

<?xml version="1.0" ?>
<Lab test results
        xmlns="http://schemas.oceanehr.com/templates"
        xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
        xmlns:rm="http://schemas.openehr.org/v1"
        template_id="openEHR-EHR-COMPOSITION.t_laboratory_test_result_report.v2.1">
    <name>
        <value>Lab test results</value>
    </name>
    <language>
        <terminology_id>
            <value>ISO_639-1</value>
        </terminology_id>
        <code_string>ru</code_string>

CodePudding user response:

I'd still use .getpath().

The reason you're getting * in your paths is because your XML has a default namespace. By using * the namespace doesn't need to be taken into account when creating a usable xpath.

To resolve this, first set the element name (.tag) to the local-name (element name without prefix or uri).

Also, you can create an XMLParser and set remove_blank_text to True to get rid of the entries that are only whitespace.

Example...

XML Input (test.xml)

<Lab_test_results
        xmlns="http://schemas.oceanehr.com/templates"
        xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
        xmlns:rm="http://schemas.openehr.org/v1"
        template_id="openEHR-EHR-COMPOSITION.t_laboratory_test_result_report.v2.1">
    <name>
        <value>Lab test results</value>
    </name>
    <language>
        <terminology_id>
            <value>ISO_639-1</value>
        </terminology_id>
    </language>
</Lab_test_results>

Python

from lxml import etree
from pprint import pprint

parser = etree.XMLParser(remove_blank_text=True)

tree = etree.parse('test.xml', parser=parser)

xpaths = []

for elem in tree.iter():
    elem.tag = etree.QName(elem).localname
    xpaths.append((tree.getpath(elem), elem.text))

pprint(xpaths)

Printed Output

[('/Lab_test_results', None),
 ('/Lab_test_results/name', None),
 ('/Lab_test_results/name/value', 'Lab test results'),
 ('/Lab_test_results/language', None),
 ('/Lab_test_results/language/terminology_id', None),
 ('/Lab_test_results/language/terminology_id/value', 'ISO_639-1')]

If you need to also collect attributes, you can make a few small changes...

for elem in tree.iter():
    elem.tag = etree.QName(elem).localname
    xpath = tree.getpath(elem)
    xpaths.append((xpath, elem.text))
    for attr in elem.attrib:
        xpaths.append((f"{xpath}/@{attr}", elem.get(attr)))
  • Related