Home > Back-end >  How to find a date in a xml file that has the right datetypecode
How to find a date in a xml file that has the right datetypecode

Time:12-13

I am parsing an xml file with python (3.7) Elementree, and the aim is to change the date in it. However, as there are three dates present, I need to pinpoint the right one for editing without modifying the others. The XML part looks as follows (apologies if the formatting is off):

<CI_Citation>
  <date>
    <CI_Date>
      <date>
        <gco:Date>2003-07-01</gco:Date>
      </date>
      <dateType>
        <CI_DateTypeCode CodeList="http://www.isotc211.org/2005/resources/Codelist/gmxCodelists.xml#CI_DateTypeCode" codeListValue="creation" codeSpace="ISOTC211/19115">creation</CI_DateTypeCode>
      </dateType>
    </CI_Date>
  </date>
  <date>
    <CI_Date>
      <date>
        <gco:Date>2003-07-01</gco:Date>
      </date>
      <dateType>
        <CI_DateTypeCode codeList="http://www.isotc211.org/2005/resources/Codelist/gmxCodelists.xml#CI_DateTypeCode" codeListValue="publication" codeSpace="ISOTC211/19115">publication</CI_DateTypeCode>
      </dateType>
    </CI_Date>
  </date>
  <date>
    <CI_Date>
      <date>
        <gco:Date>2022-12-02</gco:Date>
      </date>
      <dateType>
        <CI_DateTypeCode CodeList="http://www.isotc211.org/2005/resources/Codelist/gmxCodelists.xml#CI_DateTypeCode" codeListValue="revision" codeSpace="ISOTC211/19115">revision</CI_DateTypeCode>
      </dateType>
    </CI_Date>
  </date>
</CI_Citation>

On basis of the namespaces I'm able to find the three dates without much trouble, but of the three how to get the revision type code? As far as I can tell the path of the date nodes are all the same, but the accompanying DateType should tell me which one to edit, but there're on the same level.

I'm iterating through the XML file with the following function:

def etree_iter_path(node, rpath, tag=None):
    if tag == "*":
        tag = None
    if tag is None or node.tag == tag:
        yield node, rpath
    for child in node:
        _child_path = '%s/%s' % (rpath, child.tag)
        for subchild, subchild_path in etree_iter_path(child, tag=child.tag, rpath=_child_path):
            yield subchild, subchild_path

Parsing the XML file with ElementTree, then getroot() and using the function to iterate over all nodes, this way I'll find the dates and datetypes as seperate entities, which make modifying one impossible (or so I think currently). Any thoughts?

I would expect finding the date and datetype as a pair, rather then seperate entities, so the full path in the xml tree would be easy to find.

CodePudding user response:

I missed the namespace in your XML root tag! After that correction, I suggest to make a list of tuples e.g. (date, type), to change the date. Not clear what you like to change. Edit or format the date.

Here you find my idea to parse your xml in a dataframe:

import pandas as pd
import xml.etree.ElementTree as ET

tree = ET.parse('dateRevision.xml')
root = tree.getroot()

ns = "{http://www.isotc211.org/2005/resources/Codelist/gmxCodelists.xml#CI_DateTypeCode}"

columns = ["Date", "CI_DateTypeCode_attribute", "CI_DateTypeCode_text" ]
gco_list = []
for elem in root.findall("./date/CI_Date"):
    row = []
    for gco_date in elem.iter():
        if gco_date.tag == f"{ns}Date":  
            gcodate = gco_date.text
        if gco_date.tag == "CI_DateTypeCode":
            type_attrib = gco_date.get('codeListValue')
            type_text = gco_date.text
            row = [gcodate, type_attrib, type_text]    
            gco_list.append(row)
            row = []
        
df = pd.DataFrame(gco_list, columns=columns)
print(df)

Output:

         Date CI_DateTypeCode_attribute CI_DateTypeCode_text
0  2003-07-01                  creation             creation
1  2003-07-01               publication          publication
2  2022-12-02                  revision             revision
  • Related