I'm attempting to read xml files using pandas read_xml
, but struggling with including the parent attributes in the output. Is this possible using read_xml
or do I need to use a different parser? Example xml, what I have attempted and desired output below.
import pandas as pd
xml = '''<?xml version="1.0" encoding="ISO-8859-1" ?>
<LocationList>
<LocationData LocationId="123" name="LocationName">
<ChannelData channelId="1" name="Temperature" >
<Values>
<VT t="2020-08-18T20:30:00">3.2</VT>
<VT t="2020-08-18T21:30:00">3.3</VT>
<VT t="2020-08-18T22:30:00">3.2</VT>
</Values>
</ChannelData>
<ChannelData channelId="2" name="Speed" >
<Values>
<VT t="2020-08-18T20:30:00">21.7</VT>
<VT t="2020-08-18T21:30:00">21.8</VT>
<VT t="2020-08-18T22:30:00">22.0</VT>
</Values>
</ChannelData>
</LocationData>
</LocationList>
'''
# Getting all VT values, but no parent attributes
pd.read_xml(xml, xpath='.//VT')
"""
t VT
0 2020-08-18T20:30:00 3.2
1 2020-08-18T21:30:00 3.3
2 2020-08-18T22:30:00 3.2
3 2020-08-18T20:30:00 21.7
4 2020-08-18T21:30:00 21.8
5 2020-08-18T22:30:00 22.0
"""
# Alternative to read one channel at the time
# But want to avoid opening file several times since they can be large
# For example like this, and then build the dataframe in a loop
pd.read_xml(xml, xpath='.//ChannelData[@channelId="1"]/Values/VT')
"""
t VT
0 2020-08-18T20:30:00 3.2
1 2020-08-18T21:30:00 3.3
2 2020-08-18T22:30:00 3.2
"""
Example of desired output (can contain all or selection of parent attributes), ideally from only reading xml file once
t VT channelId LocationId
0 2020-08-18T20:30:00 3.2 1 123
1 2020-08-18T21:30:00 3.3 1 123
2 2020-08-18T22:30:00 3.2 1 123
3 2020-08-18T20:30:00 21.7 2 123
4 2020-08-18T21:30:00 21.8 2 123
5 2020-08-18T22:30:00 22.0 2 123
CodePudding user response:
Using ElementTree - see below
import xml.etree.ElementTree as ET
import pandas as pd
xml = '''<?xml version="1.0" encoding="ISO-8859-1" ?>
<LocationList>
<LocationData LocationId="123" name="LocationName">
<ChannelData channelId="1" name="Temperature" >
<Values>
<VT t="2020-08-18T20:30:00">3.2</VT>
<VT t="2020-08-18T21:30:00">3.3</VT>
<VT t="2020-08-18T22:30:00">3.2</VT>
</Values>
</ChannelData>
<ChannelData channelId="2" name="Speed" >
<Values>
<VT t="2020-08-18T20:30:00">21.7</VT>
<VT t="2020-08-18T21:30:00">21.8</VT>
<VT t="2020-08-18T22:30:00">22.0</VT>
</Values>
</ChannelData>
</LocationData>
</LocationList>
'''
data = []
root = ET.fromstring(xml)
for ld in root.findall('.//LocationData'):
location = ld.attrib['LocationId']
for cd in ld.findall('ChannelData'):
channel = cd.attrib['channelId']
for vt in cd.findall('.//VT'):
data.append({'t': vt.attrib['t'],'VT': vt.text,'channelId':channel,'LocationId':location})
df = pd.DataFrame(data)
print(df)
output
t VT channelId LocationId
0 2020-08-18T20:30:00 3.2 1 123
1 2020-08-18T21:30:00 3.3 1 123
2 2020-08-18T22:30:00 3.2 1 123
3 2020-08-18T20:30:00 21.7 2 123
4 2020-08-18T21:30:00 21.8 2 123
5 2020-08-18T22:30:00 22.0 2 123
CodePudding user response:
One option is to make the path generic, to capture as much as you want, starting from LocationData node:
(pd
.read_xml(xml, xpath='.//*')
.dropna(how = 'all', axis = 1)
.assign(channelId = lambda df: df.channelId.ffill(),
LocationId = lambda df: df.LocationId.ffill(),
name = lambda df: df.name.ffill())
.dropna(subset='t')
)
LocationId name channelId VT t
3 123.0 Temperature 1.0 3.2 2020-08-18T20:30:00
4 123.0 Temperature 1.0 3.3 2020-08-18T21:30:00
5 123.0 Temperature 1.0 3.2 2020-08-18T22:30:00
8 123.0 Speed 2.0 21.7 2020-08-18T20:30:00
9 123.0 Speed 2.0 21.8 2020-08-18T21:30:00
10 123.0 Speed 2.0 22.0 2020-08-18T22:30:00