I am parsing an xml file with python (3.7) Elementree, and the aim is to change the date in it. However, as there are three dates present, I need to pinpoint the right one for editing without modifying the others. The XML part looks as follows (apologies if the formatting is off):
<CI_Citation>
<date>
<CI_Date>
<date>
<gco:Date>2003-07-01</gco:Date>
</date>
<dateType>
<CI_DateTypeCode CodeList="http://www.isotc211.org/2005/resources/Codelist/gmxCodelists.xml#CI_DateTypeCode" codeListValue="creation" codeSpace="ISOTC211/19115">creation</CI_DateTypeCode>
</dateType>
</CI_Date>
</date>
<date>
<CI_Date>
<date>
<gco:Date>2003-07-01</gco:Date>
</date>
<dateType>
<CI_DateTypeCode codeList="http://www.isotc211.org/2005/resources/Codelist/gmxCodelists.xml#CI_DateTypeCode" codeListValue="publication" codeSpace="ISOTC211/19115">publication</CI_DateTypeCode>
</dateType>
</CI_Date>
</date>
<date>
<CI_Date>
<date>
<gco:Date>2022-12-02</gco:Date>
</date>
<dateType>
<CI_DateTypeCode CodeList="http://www.isotc211.org/2005/resources/Codelist/gmxCodelists.xml#CI_DateTypeCode" codeListValue="revision" codeSpace="ISOTC211/19115">revision</CI_DateTypeCode>
</dateType>
</CI_Date>
</date>
</CI_Citation>
On basis of the namespaces I'm able to find the three dates without much trouble, but of the three how to get the revision type code? As far as I can tell the path of the date nodes are all the same, but the accompanying DateType should tell me which one to edit, but there're on the same level.
I'm iterating through the XML file with the following function:
def etree_iter_path(node, rpath, tag=None):
if tag == "*":
tag = None
if tag is None or node.tag == tag:
yield node, rpath
for child in node:
_child_path = '%s/%s' % (rpath, child.tag)
for subchild, subchild_path in etree_iter_path(child, tag=child.tag, rpath=_child_path):
yield subchild, subchild_path
Parsing the XML file with ElementTree, then getroot() and using the function to iterate over all nodes, this way I'll find the dates and datetypes as seperate entities, which make modifying one impossible (or so I think currently). Any thoughts?
I would expect finding the date and datetype as a pair, rather then seperate entities, so the full path in the xml tree would be easy to find.
CodePudding user response:
I missed the namespace
in your XML root tag! After that correction, I suggest to make a list of tuples e.g. (date, type), to change the date. Not clear what you like to change. Edit or format the date.
Here you find my idea to parse your xml in a dataframe:
import pandas as pd
import xml.etree.ElementTree as ET
tree = ET.parse('dateRevision.xml')
root = tree.getroot()
ns = "{http://www.isotc211.org/2005/resources/Codelist/gmxCodelists.xml#CI_DateTypeCode}"
columns = ["Date", "CI_DateTypeCode_attribute", "CI_DateTypeCode_text" ]
gco_list = []
for elem in root.findall("./date/CI_Date"):
row = []
for gco_date in elem.iter():
if gco_date.tag == f"{ns}Date":
gcodate = gco_date.text
if gco_date.tag == "CI_DateTypeCode":
type_attrib = gco_date.get('codeListValue')
type_text = gco_date.text
row = [gcodate, type_attrib, type_text]
gco_list.append(row)
row = []
df = pd.DataFrame(gco_list, columns=columns)
print(df)
Output:
Date CI_DateTypeCode_attribute CI_DateTypeCode_text
0 2003-07-01 creation creation
1 2003-07-01 publication publication
2 2022-12-02 revision revision