Home > Enterprise >  Including parent attributes when parsing xml using pandas read_xml
Including parent attributes when parsing xml using pandas read_xml

Time:04-20

I'm attempting to read xml files using pandas read_xml, but struggling with including the parent attributes in the output. Is this possible using read_xml or do I need to use a different parser? Example xml, what I have attempted and desired output below.

import pandas as pd

xml = '''<?xml version="1.0" encoding="ISO-8859-1" ?>
<LocationList>
 <LocationData LocationId="123" name="LocationName">                       
  <ChannelData channelId="1" name="Temperature"  >
  <Values>
   <VT t="2020-08-18T20:30:00">3.2</VT>
   <VT t="2020-08-18T21:30:00">3.3</VT>
   <VT t="2020-08-18T22:30:00">3.2</VT>
  </Values>
  </ChannelData>
  <ChannelData channelId="2" name="Speed" >
  <Values>
   <VT t="2020-08-18T20:30:00">21.7</VT>
   <VT t="2020-08-18T21:30:00">21.8</VT>
   <VT t="2020-08-18T22:30:00">22.0</VT>
  </Values>
  </ChannelData>
 </LocationData>
</LocationList>
'''

# Getting all VT values, but no parent attributes
pd.read_xml(xml, xpath='.//VT')
"""
                     t    VT
0  2020-08-18T20:30:00   3.2
1  2020-08-18T21:30:00   3.3
2  2020-08-18T22:30:00   3.2
3  2020-08-18T20:30:00  21.7
4  2020-08-18T21:30:00  21.8
5  2020-08-18T22:30:00  22.0
"""

# Alternative to read one channel at the time
# But want to avoid opening file several times since they can be large
# For example like this, and then build the dataframe in a loop
pd.read_xml(xml, xpath='.//ChannelData[@channelId="1"]/Values/VT')
"""
                     t   VT
0  2020-08-18T20:30:00  3.2
1  2020-08-18T21:30:00  3.3
2  2020-08-18T22:30:00  3.2
"""

Example of desired output (can contain all or selection of parent attributes), ideally from only reading xml file once

                     t    VT  channelId   LocationId
0  2020-08-18T20:30:00   3.2          1          123
1  2020-08-18T21:30:00   3.3          1          123
2  2020-08-18T22:30:00   3.2          1          123
3  2020-08-18T20:30:00  21.7          2          123
4  2020-08-18T21:30:00  21.8          2          123
5  2020-08-18T22:30:00  22.0          2          123

CodePudding user response:

Using ElementTree - see below

import xml.etree.ElementTree as ET
import pandas as pd

xml = '''<?xml version="1.0" encoding="ISO-8859-1" ?>
<LocationList>
 <LocationData LocationId="123" name="LocationName">                       
  <ChannelData channelId="1" name="Temperature"  >
  <Values>
   <VT t="2020-08-18T20:30:00">3.2</VT>
   <VT t="2020-08-18T21:30:00">3.3</VT>
   <VT t="2020-08-18T22:30:00">3.2</VT>
  </Values>
  </ChannelData>
  <ChannelData channelId="2" name="Speed" >
  <Values>
   <VT t="2020-08-18T20:30:00">21.7</VT>
   <VT t="2020-08-18T21:30:00">21.8</VT>
   <VT t="2020-08-18T22:30:00">22.0</VT>
  </Values>
  </ChannelData>
 </LocationData>
</LocationList>
'''

data = []
root = ET.fromstring(xml)
for ld in root.findall('.//LocationData'):
  location = ld.attrib['LocationId']
  for cd in ld.findall('ChannelData'):
    channel = cd.attrib['channelId']
    for vt in cd.findall('.//VT'):
      data.append({'t': vt.attrib['t'],'VT': vt.text,'channelId':channel,'LocationId':location})
      
df = pd.DataFrame(data)
print(df)

output

                     t    VT channelId LocationId
0  2020-08-18T20:30:00   3.2         1        123
1  2020-08-18T21:30:00   3.3         1        123
2  2020-08-18T22:30:00   3.2         1        123
3  2020-08-18T20:30:00  21.7         2        123
4  2020-08-18T21:30:00  21.8         2        123
5  2020-08-18T22:30:00  22.0         2        123

CodePudding user response:

One option is to make the path generic, to capture as much as you want, starting from LocationData node:

(pd
.read_xml(xml, xpath='.//*')
.dropna(how = 'all', axis = 1)
.assign(channelId = lambda df: df.channelId.ffill(), 
        LocationId = lambda df: df.LocationId.ffill(), 
        name = lambda df: df.name.ffill())
.dropna(subset='t')
)
    LocationId         name  channelId    VT                    t
3        123.0  Temperature        1.0   3.2  2020-08-18T20:30:00
4        123.0  Temperature        1.0   3.3  2020-08-18T21:30:00
5        123.0  Temperature        1.0   3.2  2020-08-18T22:30:00
8        123.0        Speed        2.0  21.7  2020-08-18T20:30:00
9        123.0        Speed        2.0  21.8  2020-08-18T21:30:00
10       123.0        Speed        2.0  22.0  2020-08-18T22:30:00
  • Related