A noob question on XML. I am using ElementTree to convert the XML to a df and eventually upload to a database. My problem is that the XML format is not standard. All ElementTree examples that I have seen use a different format. It looks something like this:
<session session_id="9">
<party party_id="1" name="party 1">
<member member_id="1" name="bob"/>
<member member_id="2" name="simon"/>
</party>
<party party_id="2" name="party 2">
<member member_id="3" name="diana"/>
<member member_id="4" name="pablo"/>
</party>
</session>
For one thing, all information is written in attributes, which is not a big issue because I can still fetch them. However I want to pick up the attributes not only of the member
elements, but also of their parents. So the result should look something like this:
member_id | member_name | party_id | session_id |
---|---|---|---|
1 | bob | 1 | 9 |
2 | simon | 1 | 9 |
3 | diana | 2 | 9 |
4 | pablo | 2 | 9 |
I use children = list(root.iter())
to list all children and then append their attributes to a dataframe. However I lose the link to parent, so I cannot really say which party
branch the member
came from, so I cannot assign the right party_id
.
I am wondering whether there is an easy way to get a dataframe out of this XML structure?
CodePudding user response:
something like this
import xml.etree.ElementTree as ET
import pandas as pd
xml = '''<session session_id="9">
<party party_id="1" name="party 1">
<member member_id="1" name="bob"/>
<member member_id="2" name="simon"/>
</party>
<party party_id="2" name="party 2">
<member member_id="3" name="diana"/>
<member member_id="4" name="pablo"/>
</party>
</session>'''
data = []
root = ET.fromstring(xml)
session_id = root.attrib['session_id']
for party in root.findall('party'):
for memeber in party.findall('member'):
data.append({'session_id':session_id,'party_id': party.attrib['party_id']})
data[-1]['member_name'] = memeber.attrib['name']
data[-1]['member_id'] = memeber.attrib['member_id']
df = pd.DataFrame(data)
print(df)
output
session_id party_id member_name member_id
0 9 1 bob 1
1 9 1 simon 2
2 9 2 diana 3
3 9 2 pablo 4
CodePudding user response:
Your xml is fully valid and well formed, but the example below indicates how xml is usually structured.
<parent attribute="supporting information">
Information
<child attribute="supporting information">
Information
</child>
</parent>
Alas, ElementTree does not provide a neat way of converting attributes to tag content/information.
Solution
You are half-way there (or possibly a third if we count the root element 'session'). You will have to iterate per xml-element level. (e.g. for every party element, iterate every member).