Home > Mobile >  Unable to parse xml data using pandas method read.xml()
Unable to parse xml data using pandas method read.xml()

Time:10-13

xml = '''<?xml version='1.0' encoding='utf-8'?>
    <doc:data xmlns:doc="https://example.com">
      <doc:row>
        <doc:shape value="triangle" />
        <doc:degrees value="180" />
        <doc:sides value="3.0"/>
      </doc:row>
      <doc:row>
        <doc:shape value="triangle" />
        <doc:degrees value="180" />
        <doc:sides value="3.0"/>
      </doc:row>
      <doc:row>
        <doc:shape value="triangle" />
        <doc:degrees value="180" />
        <doc:sides value="3.0"/>
      </doc:row>
    </doc:data>'''

df = pd.read_xml(xml,
                 xpath="//doc:row",
                 namespaces={"doc": "https://example.com"})
print(df)

I am getting the output as follows:

shape   degrees sides
0   NaN NaN NaN
1   NaN NaN NaN
2   NaN NaN NaN

Th expected output is:

shape   degrees sides
0   triangle    180 3.0
1   triangle    180 3.0
2   triangle    180 3.0

The values for each tag are present in the "value = ".Had it not been in the value tag then the data is loading properly. please help in getting the respective values for each in the above xml.

CodePudding user response:

If you know the columns beforehand you can use kwarg iterparse instead of xpath:

df = pd.read_xml("example.xml",
                 iterparse = {"row": ["value", "value", "value"]},
                 names = ["shape", "degrees", "sides"]
                 )

Output:

      shape  degrees  sides
0  triangle      180    3.0
1  triangle      180    3.0
2  triangle      180    3.0

Edit: the above solution isn't robust at all since messing with the order of the subelements will mess up the data (problem here being the identical attribute names value of the subelements). If the order might change, you can still build your columns one after the other and concatenate them:

df = pd.concat([pd.read_xml("example.xml",
                    iterparse = {name: ["value"]},
                    names = [name])
                    for name in ["shape", "degrees", "sides"]
    ], axis=1
)

No idea how it would perform on bigger file though...

  • Related