How do I export part of an XML file to a multi-level DataFrame in pandas?-CodePudding

I am trying to export part of an XML file to a multi-level DataFrame because I find it more convenient to work with. An exemple of the file would be :

<file filename="stack_example" created="today">
    <unit time="day" volume="cm3" surface="cm2"/>
    <zone z_id="10">
        <surfacehistory type="calculation">
            <surfacedata time-begin="1" time-end="2">
                <thing identity="1">
                    <location l-identity="2"> 1.256</location>
                    <location l-identity="45"> 2.3</location>
                </thing>
                <thing identity="3">
                    <location l-identity="2"> 1.6</location>
                    <location l-identity="4"> 2.5</location>
                    <location l-identity="17"> 2.4</location>
                </thing>
            </surfacedata>
            <surfacedata time-begin="2" time-end="3">
                <thing identity="1">
                    <location l-identity="78"> 3.2</location>
                </thing>
                <thing identity="5">
                    <location l-identity="2"> 1.7</location>
                    <location l-identity="7"> 4.5</location>
                </thing>
            </surfacedata>
        </surfacehistory>
    </zone>
</file>

The ideal output from this example would be a Pandas Dataframe similar to this :

time-begin  time-end     thing  location    surface
         1         2         1         2      1,256
                                      45        2,3
                             3         2        1,6
                                       4        2,5
                                      17        2,4
         2         3         1        78        3,2
                             5         2        1,7
                                       7        4,5

Here is the current code I wrote :

import pandas as pd
from bs4 import BeautifulSoup
import lxml

datas = open("stack_example.xml","r")
doc = BeautifulSoup(datas.read(), "lxml")
doc.unit.get("surface")

l = []
temp={}

surfacedatas = doc.surfacehistory.find_all("surfacedata")
for surfacedata in surfacedatas:
    time_begin = surfacedata.get("time-begin")
    time_end = surfacedata.get("time-end")

    temp["time_begin"]=[time_begin]
    temp["time_end"]=[time_end]

    things = surfacedata.find_all("thing", recursive=False)
    for thing in thingss:
        identity = thing.get("identity")
        temp["thing"]=[identity]
 
        locations = thing.find_all("location", recursive=False)
        for location in locations:
            l_identity = location.get("l-identity")
            surface = location.getText()
            temp["surface"]=[surface]
            temp["location"]=[l_identity]
        l.append(pd.DataFrame(temp))
        
res = pd.concat(l, ignore_index=True).fillna(0.)

It only gets the last location of all things because the location gets refreshed in the loop, but I'm not sure on how to achieve desired result from this point.

CodePudding user response：

You could set the index based on columns:

df.set_index(df.columns.to_list())

Take a look at some issues in your code:

There is a typo in for thing in thingss:
Instead of temp["location"]=[identity] set temp["thing"]=[identity]

Change position to get correct order of columns

temp["surface"]=[surface]
temp["location"]=[l_identity]

Example

import pandas as pd
from bs4 import BeautifulSoup

xml = '''
<file filename="stack_example" created="today">
    <unit time="day" volume="cm3" surface="cm2"/>
    <zone z_id="10">
        <surfacehistory type="calculation">
            <surfacedata time-begin="1" time-end="2">
                <thing identity="1">
                    <location l-identity="2"> 1.256</location>
                    <location l-identity="45"> 2.3</location>
                </thing>
                <thing identity="3">
                    <location l-identity="2"> 1.6</location>
                    <location l-identity="4"> 2.5</location>
                    <location l-identity="17"> 2.4</location>
                </thing>
            </surfacedata>
            <surfacedata time-begin="2" time-end="3">
                <thing identity="1">
                    <location l-identity="78"> 3.2</location>
                </thing>
                <thing identity="5">
                    <location l-identity="2"> 1.7</location>
                    <location l-identity="7"> 4.5</location>
                </thing>
            </surfacedata>
        </surfacehistory>
    </zone>
</file>
'''

doc = BeautifulSoup(xml, "lxml")


l = []
temp = {}

surfacedatas = doc.surfacehistory.find_all("surfacedata")
for surfacedata in surfacedatas:
    time_begin = surfacedata.get("time-begin")
    time_end = surfacedata.get("time-end")

    temp["time_begin"]=[time_begin]
    temp["time_end"]=[time_end]

    things = surfacedata.find_all("thing", recursive=False)
    for thing in things:
        identity = thing.get("identity")
        temp["thing"]=[identity]
 
        locations = thing.find_all("location", recursive=False)
        for location in locations:
            l_identity = location.get("l-identity")
            surface = location.getText()
            temp["location"]=[l_identity]
            temp["surface"]=[surface]
            l.append(pd.DataFrame(temp))

df = pd.concat(l, ignore_index=True).fillna(0.)
df.set_index(df.columns.to_list())

CodePudding user response：

You nearly have it. Just slightly altered the logic a bit. But looks right.

datas = '''<file filename="stack_example" created="today">
    <unit time="day" volume="cm3" surface="cm2"/>
    <zone z_id="10">
        <surfacehistory type="calculation">
            <surfacedata time-begin="1" time-end="2">
                <thing identity="1">
                    <location l-identity="2"> 1.256</location>
                    <location l-identity="45"> 2.3</location>
                </thing>
                <thing identity="3">
                    <location l-identity="2"> 1.6</location>
                    <location l-identity="4"> 2.5</location>
                    <location l-identity="17"> 2.4</location>
                </thing>
            </surfacedata>
            <surfacedata time-begin="2" time-end="3">
                <thing identity="1">
                    <location l-identity="78"> 3.2</location>
                </thing>
                <thing identity="5">
                    <location l-identity="2"> 1.7</location>
                    <location l-identity="7"> 4.5</location>
                </thing>
            </surfacedata>
        </surfacehistory>
    </zone>
</file>'''

Code:

import pandas as pd
from bs4 import BeautifulSoup
import lxml

#datas = open("stack_example.xml","r")
#doc = BeautifulSoup(datas.read(), "lxml")
doc = BeautifulSoup(datas, "lxml")
doc.unit.get("surface")

rows = []
surfacedatas = doc.surfacehistory.find_all("surfacedata")
for surfacedata in surfacedatas:
    row = {}
    time_begin = surfacedata.get("time-begin")
    time_end = surfacedata.get("time-end")

    row = {'time-begin':time_begin,
         'time-end':time_end}


    things = surfacedata.find_all("thing", recursive=False)
    for thing in things:
        identity = thing.get("identity")
        row.update({'thing':identity})
        
        locations = thing.find_all("location", recursive=False)
        for location in locations:
            locationStr = location['l-identity']   
            surface = location.text.strip()
            
            row.update({'location':locationStr,
                        'surface':surface})
            
            row_copy = row.copy()
            rows.append(row_copy)
            
df = pd.DataFrame(rows)

Output:

print(df)
  time-begin time-end thing location surface
0          1        2     1        2   1.256
1          1        2     1       45     2.3
2          1        2     3        2     1.6
3          1        2     3        4     2.5
4          1        2     3       17     2.4
5          2        3     1       78     3.2
6          2        3     5        2     1.7
7          2        3     5        7     4.5

CodePudding user response：

Since you are using lxml, consider XSLT, the special-purpose language designed to transform XML files, and the recent IO module, pandas.read_xml introduced in v1.3. While this method works on flat XML files by default, its stylesheet argument allows you to transform the raw input to flatter format for data frame migration.

Specifically, XSLT parses down to the <location> nodes and pull parent and ancestor attributes as siblings for flatter structure with repeated nodes.

XSLT (save as .xsl file, a special .xml file)

<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
   <xsl:output omit-xml-declaration="yes" indent="yes"/>
   <xsl:strip-space elements="*"/>
  
   <xsl:template match="/file">
     <data>
       <xsl:apply-templates select="descendant::location"/>
     </data>
   </xsl:template>

   <xsl:template match="location">
     <row>
       <time_begin><xsl:value-of select="ancestor::surfacedata/@time-begin"/></time_begin>
       <time_end><xsl:value-of select="ancestor::surfacedata/@time-end"/></time_end>
       <thing><xsl:value-of select="parent::thing/@identity"/></thing>
       <location><xsl:value-of select="@l-identity"/></location>
       <surface><xsl:value-of select="normalize-space()"/></surface>
     </row>
   </xsl:template>
  
</xsl:stylesheet>

Online Demo

Python (no other packages needed)

import pandas as pd

xml_file = "input.xml"
xsl_file = "style.xsl"

surface_data_df = pd.read_xml(xml_file, stylesheet=xsl_file)

surface_data_df
#    time_begin  time_end  thing  location  surface
# 0           1         2      1         2    1.256
# 1           1         2      1        45    2.300
# 2           1         2      3         2    1.600
# 3           1         2      3         4    2.500
# 4           1         2      3        17    2.400
# 5           2         3      1        78    3.200
# 6           2         3      5         2    1.700
# 7           2         3      5         7    4.500