I need to extract XPaths and values from XML object. Currently I use lxml
which with either gives long paths with repeated template URLS or just indices of XPaths keys without names.
Question: How to get Xpaths with just names, without template URLs.
Yes, string cleanup after parsing works, but I hope to find a clean solution using lxml
or similar library
- with
getelementpath()
: has template URLs and'\n\t\t'
in empty keys.
>> [(root1.getelementpath(e), e.text) for e in root1.iter()][5:10]
[('{http://schemas.oceanehr.com/templates}language/{http://schemas.oceanehr.com/templates}terminology_id/{http://schemas.oceanehr.com/templates}value',
'ISO_639-1'),
('{http://schemas.oceanehr.com/templates}language/{http://schemas.oceanehr.com/templates}code_string',
'xx'),
('{http://schemas.oceanehr.com/templates}territory', '\n\t\t'),
('{http://schemas.oceanehr.com/templates}territory/{http://schemas.oceanehr.com/templates}terminology_id',
'\n\t\t\t'),
('{http://schemas.oceanehr.com/templates}territory/{http://schemas.oceanehr.com/templates}terminology_id/{http://schemas.oceanehr.com/templates}value',
'ISO_3166-1')]
- with
getpath()
: has no key names URLs and'\n\t\t'
in empty keys.
>> [(root1.getpath(e), e.text) for e in root1.iter()][5:10]
[('/*/*[2]/*[1]/*', 'ISO_639-1'),
('/*/*[2]/*[2]', 'xx'),
('/*/*[3]', '\n\t\t'),
('/*/*[3]/*[1]', '\n\t\t\t'),
('/*/*[3]/*[1]/*', 'ISO_3166-1')]
- what I need: key names URLs and
None
in empty keys. I believe I've seen it somewhere, but can't find now...
[('language/terminology_id/value', 'ISO_639-1'),
('language/code_string','xx'),
('territory', None),
('territory/terminology_id', None),
('territory/terminology_id/value', 'ISO_3166-1')]
this is the XML header:
<?xml version="1.0" ?>
<Lab test results
xmlns="http://schemas.oceanehr.com/templates"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xmlns:rm="http://schemas.openehr.org/v1"
template_id="openEHR-EHR-COMPOSITION.t_laboratory_test_result_report.v2.1">
<name>
<value>Lab test results</value>
</name>
<language>
<terminology_id>
<value>ISO_639-1</value>
</terminology_id>
<code_string>ru</code_string>
CodePudding user response:
I'd still use .getpath()
.
The reason you're getting *
in your paths is because your XML has a default namespace. By using *
the namespace doesn't need to be taken into account when creating a usable xpath.
To resolve this, first set the element name (.tag
) to the local-name (element name without prefix or uri).
Also, you can create an XMLParser
and set remove_blank_text
to True
to get rid of the entries that are only whitespace.
Example...
XML Input (test.xml)
<Lab_test_results
xmlns="http://schemas.oceanehr.com/templates"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xmlns:rm="http://schemas.openehr.org/v1"
template_id="openEHR-EHR-COMPOSITION.t_laboratory_test_result_report.v2.1">
<name>
<value>Lab test results</value>
</name>
<language>
<terminology_id>
<value>ISO_639-1</value>
</terminology_id>
</language>
</Lab_test_results>
Python
from lxml import etree
from pprint import pprint
parser = etree.XMLParser(remove_blank_text=True)
tree = etree.parse('test.xml', parser=parser)
xpaths = []
for elem in tree.iter():
elem.tag = etree.QName(elem).localname
xpaths.append((tree.getpath(elem), elem.text))
pprint(xpaths)
Printed Output
[('/Lab_test_results', None),
('/Lab_test_results/name', None),
('/Lab_test_results/name/value', 'Lab test results'),
('/Lab_test_results/language', None),
('/Lab_test_results/language/terminology_id', None),
('/Lab_test_results/language/terminology_id/value', 'ISO_639-1')]
If you need to also collect attributes, you can make a few small changes...
for elem in tree.iter():
elem.tag = etree.QName(elem).localname
xpath = tree.getpath(elem)
xpaths.append((xpath, elem.text))
for attr in elem.attrib:
xpaths.append((f"{xpath}/@{attr}", elem.get(attr)))