Home > Software engineering >  Add parent element to child element to have one tuple (when converting XML to dictionary collection)
Add parent element to child element to have one tuple (when converting XML to dictionary collection)

Time:12-09

my ad-hoc question relates to the following

given this xml structure (saved in xml_FILE):

<countriesAndStates>
  <countries>
    <name>USA</name>
    <states>
      <active>true</active>
      <stateName>Colorado</stateName>
      <isoCode>CO</isoCode>
    </states>
    <states>
      <active>false</active>
      <stateName>Florida</stateName>
      <isoCode>FL</isoCode>
    </states>
  </countries>
</countriesAndStates>

I traverse the state element level via this for-loop and save the result to a dictionary of lists through the help of collections from defaultdict as follows:

from collections import defaultdict
tree = ET.parse(xml_FILE)
root = tree.getroot()
    
dict_of_list = defaultdict(list)
for key in root.findall("./countriesAndStates/"
                              "countries/"
                              "states/"):
    dict_of_list[key.tag].append(key.text)

I then transform this dict to a dataframe and I'll have all the tuples containing the state element's data, cf.:

df = pd.DataFrame(dict_of_list)
print(df)

This way I get the following dataframe output (scheme tuple):

     active    stateName isoCode               
0    true      Colorado  CO       

However, I want to have the country for each state tuple such that each tuple/row in dataframe will transform to:

    name  active    stateName isoCode               
0   USA   true      Colorado  CO  

In other words: for each state/record I want the country name too. How can I achieve this?

Thank you in advance.

CodePudding user response:

something like this

import xml.etree.ElementTree as ET
import pandas as pd


xml = '''<countriesAndStates>
  <countries>
    <name>USA</name>
    <states>
      <active>true</active>
      <stateName>Colorado</stateName>
      <isoCode>CO</isoCode>
    </states>
    <states>
      <active>false</active>
      <stateName>Florida</stateName>
      <isoCode>FL</isoCode>
    </states>
  </countries>
</countriesAndStates>'''

data = []
root = ET.fromstring(xml)
for country in root.findall('.//countries'):
  name = country.find('name').text
  for state in country.findall('states'):
    data.append({'name':name})
    for e in list(state):
      data[-1][e.tag] = e.text
df = pd.DataFrame(data)
print(df)

output

  name active stateName isoCode
0  USA   true  Colorado      CO
1  USA  false   Florida      FL
  • Related