Home > Software design >  How to read xml tag with multiple data using pandas pd.read_xml in Python?
How to read xml tag with multiple data using pandas pd.read_xml in Python?

Time:03-14

I have following toy example code to read an xml using pandas

xml = '''<?xml version='1.0' encoding='utf-8'?>
<data>
    <d>10</d>
    <d>20</d>
    <d>11</d>
    <d>2</d>
    <d>5</d>
</data>'''

import pandas as pd
df = pd.read_xml(xml,xpath="//data")

Result

>>> df
   d
0  5

It seems it only reads the last data item.

Expected result

However I expect the following DataFrame as result:

>>> df
    d
0  10
1  20
2  11
3   2
4   5

Question

Is it possible to read all the d tags in a single DataFrame using pd.read_xml?

CodePudding user response:

That's not a use case supported by pd.read_xml. It works best if the XPath results in a list of nodes of the form:

<row>
  <col1>...</col1>
  <col2>...</col2>
  <col3>...</col3>
</row>

You can parse the XML manually:

from xml.etree import ElementTree

xml = '''<?xml version='1.0' encoding='utf-8'?>
<data>
    <d>10</d>
    <d>20</d>
    <d>11</d>
    <d>2</d>
    <d>5</d>
</data>'''

tree = ElementTree.fromstring(xml)
df = pd.DataFrame({
    'd': [int(d.text) for d in tree.findall(".//d")]
})
  • Related