I have a xml file like:
<plays format="tokens">
<period number="1">
<play/>
<play/>
<play/>
</period>
<period number="2">
<play/>
<play/>
<play/>
</period>
Each play tag contains a bunch of variables, but I would also like to add the period number as a variable to the play tags. My goal is to produce a table with each play and their attributes as well as a column that says which period that played occurred in (1 or 2).
My current code to flatten the plays out is:
d = []
for play in root.iter('play'):
d.append(play.attrib)
df = pd.DataFrame(d)
This gives me every play and their attributes in the table df, but the period is not currently included in this table. Any direction would help, thank you!
CodePudding user response:
You can do it this way with ElementTree
like below-
plays.xml
<plays format="tokens">
<period number="1">
<play attr1="abc" attr2="ddd"/>
<play attr1="cbc" attr2="ddd"/>
<play attr1="dbc" attr2="ddd"/>
</period>
<period number="2">
<play attr1="abc" attr2="ddd"/>
<play attr1="dbc" attr2="ddd"/>
<play attr1="kbc" attr2="ddd" />
</period>
</plays>
main.py
import xml.etree.ElementTree as ET
import pandas as pd
tree = ET.parse('plays.xml')
root = tree.getroot()
# find the period number for each play by searching for the parent period element
periods = []
for period in root.iter('period'):
number = period.attrib['number']
for play in period.iter('play'):
other_attr = play.attrib
# this line merges the other_attributes of play element(attr1, attr2) with the top attribute(number) of period element, see reference: https://stackoverflow.com/a/62820532/1138192
periods.append({**{"number": number}, **other_attr})
df = pd.DataFrame(periods)
print(df)
Output:
number attr1 attr2
0 1 abc ddd
1 1 cbc ddd
2 1 dbc ddd
3 2 abc ddd
4 2 dbc ddd
5 2 kbc ddd