I am trying to export part of an XML file to a multi-level DataFrame because I find it more convenient to work with. An exemple of the file would be :
<file filename="stack_example" created="today">
<unit time="day" volume="cm3" surface="cm2"/>
<zone z_id="10">
<surfacehistory type="calculation">
<surfacedata time-begin="1" time-end="2">
<thing identity="1">
<location l-identity="2"> 1.256</location>
<location l-identity="45"> 2.3</location>
</thing>
<thing identity="3">
<location l-identity="2"> 1.6</location>
<location l-identity="4"> 2.5</location>
<location l-identity="17"> 2.4</location>
</thing>
</surfacedata>
<surfacedata time-begin="2" time-end="3">
<thing identity="1">
<location l-identity="78"> 3.2</location>
</thing>
<thing identity="5">
<location l-identity="2"> 1.7</location>
<location l-identity="7"> 4.5</location>
</thing>
</surfacedata>
</surfacehistory>
</zone>
</file>
The ideal output from this example would be a Pandas Dataframe similar to this :
time-begin time-end thing location surface
1 2 1 2 1,256
45 2,3
3 2 1,6
4 2,5
17 2,4
2 3 1 78 3,2
5 2 1,7
7 4,5
Here is the current code I wrote :
import pandas as pd
from bs4 import BeautifulSoup
import lxml
datas = open("stack_example.xml","r")
doc = BeautifulSoup(datas.read(), "lxml")
doc.unit.get("surface")
l = []
temp={}
surfacedatas = doc.surfacehistory.find_all("surfacedata")
for surfacedata in surfacedatas:
time_begin = surfacedata.get("time-begin")
time_end = surfacedata.get("time-end")
temp["time_begin"]=[time_begin]
temp["time_end"]=[time_end]
things = surfacedata.find_all("thing", recursive=False)
for thing in thingss:
identity = thing.get("identity")
temp["thing"]=[identity]
locations = thing.find_all("location", recursive=False)
for location in locations:
l_identity = location.get("l-identity")
surface = location.getText()
temp["surface"]=[surface]
temp["location"]=[l_identity]
l.append(pd.DataFrame(temp))
res = pd.concat(l, ignore_index=True).fillna(0.)
It only gets the last location of all things because the location gets refreshed in the loop, but I'm not sure on how to achieve desired result from this point.
CodePudding user response:
You could set the index based on columns:
df.set_index(df.columns.to_list())
Take a look at some issues in your code:
There is a typo in
for thing in thingss:
Instead of
temp["location"]=[identity]
settemp["thing"]=[identity]
Change position to get correct order of columns
temp["surface"]=[surface] temp["location"]=[l_identity]
Example
import pandas as pd
from bs4 import BeautifulSoup
xml = '''
<file filename="stack_example" created="today">
<unit time="day" volume="cm3" surface="cm2"/>
<zone z_id="10">
<surfacehistory type="calculation">
<surfacedata time-begin="1" time-end="2">
<thing identity="1">
<location l-identity="2"> 1.256</location>
<location l-identity="45"> 2.3</location>
</thing>
<thing identity="3">
<location l-identity="2"> 1.6</location>
<location l-identity="4"> 2.5</location>
<location l-identity="17"> 2.4</location>
</thing>
</surfacedata>
<surfacedata time-begin="2" time-end="3">
<thing identity="1">
<location l-identity="78"> 3.2</location>
</thing>
<thing identity="5">
<location l-identity="2"> 1.7</location>
<location l-identity="7"> 4.5</location>
</thing>
</surfacedata>
</surfacehistory>
</zone>
</file>
'''
doc = BeautifulSoup(xml, "lxml")
l = []
temp = {}
surfacedatas = doc.surfacehistory.find_all("surfacedata")
for surfacedata in surfacedatas:
time_begin = surfacedata.get("time-begin")
time_end = surfacedata.get("time-end")
temp["time_begin"]=[time_begin]
temp["time_end"]=[time_end]
things = surfacedata.find_all("thing", recursive=False)
for thing in things:
identity = thing.get("identity")
temp["thing"]=[identity]
locations = thing.find_all("location", recursive=False)
for location in locations:
l_identity = location.get("l-identity")
surface = location.getText()
temp["location"]=[l_identity]
temp["surface"]=[surface]
l.append(pd.DataFrame(temp))
df = pd.concat(l, ignore_index=True).fillna(0.)
df.set_index(df.columns.to_list())
CodePudding user response:
You nearly have it. Just slightly altered the logic a bit. But looks right.
datas = '''<file filename="stack_example" created="today">
<unit time="day" volume="cm3" surface="cm2"/>
<zone z_id="10">
<surfacehistory type="calculation">
<surfacedata time-begin="1" time-end="2">
<thing identity="1">
<location l-identity="2"> 1.256</location>
<location l-identity="45"> 2.3</location>
</thing>
<thing identity="3">
<location l-identity="2"> 1.6</location>
<location l-identity="4"> 2.5</location>
<location l-identity="17"> 2.4</location>
</thing>
</surfacedata>
<surfacedata time-begin="2" time-end="3">
<thing identity="1">
<location l-identity="78"> 3.2</location>
</thing>
<thing identity="5">
<location l-identity="2"> 1.7</location>
<location l-identity="7"> 4.5</location>
</thing>
</surfacedata>
</surfacehistory>
</zone>
</file>'''
Code:
import pandas as pd
from bs4 import BeautifulSoup
import lxml
#datas = open("stack_example.xml","r")
#doc = BeautifulSoup(datas.read(), "lxml")
doc = BeautifulSoup(datas, "lxml")
doc.unit.get("surface")
rows = []
surfacedatas = doc.surfacehistory.find_all("surfacedata")
for surfacedata in surfacedatas:
row = {}
time_begin = surfacedata.get("time-begin")
time_end = surfacedata.get("time-end")
row = {'time-begin':time_begin,
'time-end':time_end}
things = surfacedata.find_all("thing", recursive=False)
for thing in things:
identity = thing.get("identity")
row.update({'thing':identity})
locations = thing.find_all("location", recursive=False)
for location in locations:
locationStr = location['l-identity']
surface = location.text.strip()
row.update({'location':locationStr,
'surface':surface})
row_copy = row.copy()
rows.append(row_copy)
df = pd.DataFrame(rows)
Output:
print(df)
time-begin time-end thing location surface
0 1 2 1 2 1.256
1 1 2 1 45 2.3
2 1 2 3 2 1.6
3 1 2 3 4 2.5
4 1 2 3 17 2.4
5 2 3 1 78 3.2
6 2 3 5 2 1.7
7 2 3 5 7 4.5
CodePudding user response:
Since you are using lxml
, consider XSLT, the special-purpose language designed to transform XML files, and the recent IO module, pandas.read_xml
introduced in v1.3. While this method works on flat XML files by default, its stylesheet
argument allows you to transform the raw input to flatter format for data frame migration.
Specifically, XSLT parses down to the <location>
nodes and pull parent and ancestor attributes as siblings for flatter structure with repeated nodes.
XSLT (save as .xsl file, a special .xml file)
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output omit-xml-declaration="yes" indent="yes"/>
<xsl:strip-space elements="*"/>
<xsl:template match="/file">
<data>
<xsl:apply-templates select="descendant::location"/>
</data>
</xsl:template>
<xsl:template match="location">
<row>
<time_begin><xsl:value-of select="ancestor::surfacedata/@time-begin"/></time_begin>
<time_end><xsl:value-of select="ancestor::surfacedata/@time-end"/></time_end>
<thing><xsl:value-of select="parent::thing/@identity"/></thing>
<location><xsl:value-of select="@l-identity"/></location>
<surface><xsl:value-of select="normalize-space()"/></surface>
</row>
</xsl:template>
</xsl:stylesheet>
Python (no other packages needed)
import pandas as pd
xml_file = "input.xml"
xsl_file = "style.xsl"
surface_data_df = pd.read_xml(xml_file, stylesheet=xsl_file)
surface_data_df
# time_begin time_end thing location surface
# 0 1 2 1 2 1.256
# 1 1 2 1 45 2.300
# 2 1 2 3 2 1.600
# 3 1 2 3 4 2.500
# 4 1 2 3 17 2.400
# 5 2 3 1 78 3.200
# 6 2 3 5 2 1.700
# 7 2 3 5 7 4.500