I have an issue with transforming XML to DataFrame. I have the following sample XML:
<Fruits>
<Fruit ReferenceDate="2022-09-22"
FruitName="Apple">
<Identifier FruitIdentifier="111"
FruitBrand="GoldenApple"/>
<FruitInformation Country="Turkey"
Colour="Green"/>
<CompanyInformation CompanyName="GlobalFruits"
Location="USA"/>
<Languages>
<LanguageDependent CountryId="GB"
LanguageId="EN">
<FreeText1>Sample sentence 1.</FreeText1>
<FreeText2>Sample sentence 2.</FreeText2>
</LanguageDependent>
</Languages>
</Fruit>
<Fruit ReferenceDate="2022-09-22"
FruitName="Orange">
<Identifier FruitIdentifier="222"
FruitBrand="BestOrange"/>
<FruitInformation Country="Egypt"
Colour="Orange"/>
<CompanyInformation CompanyName="FreshFood"
Location="UK"/>
<Languages>
<LanguageDependent CountryId="GB"
LanguageId="EN">
<FreeText1>Sample sentence 3.</FreeText1>
<FreeText2>Sample sentence 4.</FreeText2>
</LanguageDependent>
</Languages>
</Fruit>
</Fruits>
I want to transform it to the DataFrame. The final table should look like the table on the image below:
I'm sorry in advance if it's a duplicate question, but I did not find the solution which fits me.
So far I have the following code:
import pandas as pd
import xml.etree.ElementTree as et
xtree = et.parse("fruits.xml")
xroot = xtree.getroot()
df_cols = ["ReferenceDate", "FruitName", "FruitIdentifier",
"FruitBrand", "Country", "Colour", "CompanyName",
"Location", "CountryId", "LanguageId"]
rows = []
for node in xroot.iter():
ReferenceDate = node.attrib.get("ReferenceDate")
FruitName = node.attrib.get("FruitName")
FruitIdentifier = node.attrib.get("FruitIdentifier")
FruitBrand = node.attrib.get("FruitBrand")
Country = node.attrib.get("Country")
Colour = node.attrib.get("Colour")
CompanyName = node.attrib.get("CompanyName")
Location = node.attrib.get("Location")
CountryId = node.attrib.get("CountryId")
LanguageId = node.attrib.get("LanguageId")
rows.append({"ReferenceDate": ReferenceDate, "FruitName": FruitName,
"FruitIdentifier": FruitIdentifier, "FruitBrand": FruitBrand,
"Country": Country, "Colour": Colour, "CompanyName": CompanyName, "Location": Location,
"CountryId": CountryId, "LanguageId": LanguageId})
out_df = pd.DataFrame(rows, columns = df_cols)
I have two main issues:
- Cannot get texts(FreeText1 and FreeText2);
- Each group of attributes within subquery has its own row.
CodePudding user response:
A bit shorter and more universal implementation:
import xml.etree.ElementTree as ET
import pandas as pd
root = ET.parse("fruits.xml")
rows = []
for fruit in root: # <Fruit> is child node of root node <Fruits>
rows.append(row := {})
for node in fruit.iter(): # recursive iterator
if node.attrib: # has attributes
row |= node.attrib
elif text := node.text.strip(): # has some text
row[node.tag] = text
df = pd.DataFrame(rows)
This implementation uses next logic:
- If current node has attributes we add them to over
row
dictionary; - Else if current node has no attributes but has some non-whitespace text we write this text to
row
using tag name as key.
If your <Fruit>
node could possibly has child nodes where both attributes and text should be included just replace elif text := node.text.strip()
with if node.text and (text := node.text.strip())
.
You can help my country, check my profile info.
CodePudding user response:
The below works
import pandas as pd
import xml.etree.ElementTree as ET
xml = '''<Fruits>
<Fruit ReferenceDate="2022-09-22"
FruitName="Apple">
<Identifier FruitIdentifier="111"
FruitBrand="GoldenApple"/>
<FruitInformation Country="Turkey"
Colour="Green"/>
<CompanyInformation CompanyName="GlobalFruits"
Location="USA"/>
<Languages>
<LanguageDependent CountryId="GB"
LanguageId="EN">
<FreeText1>Sample sentence 1.</FreeText1>
<FreeText2>Sample sentence 2.</FreeText2>
</LanguageDependent>
</Languages>
</Fruit>
<Fruit ReferenceDate="2022-09-22"
FruitName="Orange">
<Identifier FruitIdentifier="222"
FruitBrand="BestOrange"/>
<FruitInformation Country="Egypt"
Colour="Orange"/>
<CompanyInformation CompanyName="FreshFood"
Location="UK"/>
<Languages>
<LanguageDependent CountryId="GB"
LanguageId="EN">
<FreeText1>Sample sentence 3.</FreeText1>
<FreeText2>Sample sentence 4.</FreeText2>
</LanguageDependent>
</Languages>
</Fruit>
</Fruits>'''
root = ET.fromstring(xml)
rows = []
for f in root.findall('.//Fruit'):
ReferenceDate = f.attrib.get("ReferenceDate")
FruitName = f.attrib.get("FruitName")
FruitIdentifier = f.find("Identifier").attrib.get("FruitIdentifier")
FruitBrand = f.find("Identifier").attrib.get("FruitBrand")
Country = f.find("FruitInformation").attrib.get("Country")
Colour = f.find("FruitInformation").attrib.get("Colour")
CompanyName = f.find("CompanyInformation").attrib.get("CompanyName")
Location = f.find("CompanyInformation").attrib.get("Location")
CountryId = f.find(".//LanguageDependent").attrib.get("CountryId")
LanguageId = f.find(".//LanguageDependent").attrib.get("LanguageId")
FreeText1 = f.find(".//LanguageDependent/FreeText1").text
FreeText2 = f.find(".//LanguageDependent/FreeText2").text
rows.append({"ReferenceDate": ReferenceDate, "FruitName": FruitName,
"FruitIdentifier": FruitIdentifier, "FruitBrand": FruitBrand,
"Country": Country, "Colour": Colour, "CompanyName": CompanyName, "Location": Location,
"CountryId": CountryId, "LanguageId": LanguageId, "FreeText1": FreeText1,
"FreeText2": FreeText2})
df = pd.DataFrame(rows)
print(df)
output
ReferenceDate FruitName ... FreeText1 FreeText2
0 2022-09-22 Apple ... Sample sentence 1. Sample sentence 2.
1 2022-09-22 Orange ... Sample sentence 3. Sample sentence 4.
[2 rows x 12 columns]
CodePudding user response:
While your XML is not shallow enough for a single pandas.read_xml
, the needed data is consistent enough for multiple calls that can be horizontally merged:
fruits_df = (
pd.concat([
pd.read_xml("fruits.xml", xpath=".//Fruit", attrs_only = True, parser="etree"),
pd.read_xml("fruits.xml", xpath=".//Identifier", attrs_only = True, parser="etree"),
pd.read_xml("fruits.xml", xpath=".//FruitInformation", attrs_only = True, parser="etree"),
pd.read_xml("fruits.xml", xpath=".//CompanyInformation", attrs_only = True, parser="etree"),
pd.read_xml("fruits.xml", xpath=".//LanguageDependent", parser="etree")
],
axis=1
)
)
Alternatively, via list comprehension:
fruits_df = (
pd.concat(
[
pd.read_xml("fruits.xml", xpath=f".//{elem}", parser="etree")
for elem in
["Fruit", "Identifier", "FruitInformation", "CompanyInformation", "LanguageDependent"]
],
axis=1
).dropna(axis="columns")
)
Output
ReferenceDate FruitName FruitIdentifier FruitBrand Country Colour CompanyName Location CountryId LanguageId FreeText1 FreeText2
0 2022-09-22 Apple 111 GoldenApple Turkey Green GlobalFruits USA GB EN Sample sentence 1. Sample sentence 2.
1 2022-09-22 Orange 222 BestOrange Egypt Orange FreshFood UK GB EN Sample sentence 3. Sample sentence 4.
CodePudding user response:
Try following powershell script :
using assembly System
using assembly System.Xml.Linq
using assembly System.Data
$dt = New-Object System.Data.DataTable("Fruits")
$dt.Columns.Add("ReferenceDate", [DateTime]) | Out-Null
$dt.Columns.Add("FruitName", [string]) | Out-Null
$dt.Columns.Add("FruitIdentifier", [string]) | Out-Null
$dt.Columns.Add("FruitBrand", [string]) | Out-Null
$dt.Columns.Add("Country", [string]) | Out-Null
$dt.Columns.Add("Colour", [string]) | Out-Null
$dt.Columns.Add("CompanyName", [string]) | Out-Null
$dt.Columns.Add("Location", [string]) | Out-Null
$dt.Columns.Add("CountryId", [string]) | Out-Null
$dt.Columns.Add("LanguageId", [string]) | Out-Null
$dt.Columns.Add("FreeText1", [string]) | Out-Null
$dt.Columns.Add("FreeText2", [string]) | Out-Null
$Filename = "c:\temp\test.xml"
$xDoc = [System.Xml.Linq.XDocument]::Load($Filename)
$fruits = $xDoc.Descendants("Fruit")
foreach($fruit in $fruits)
{
$row = $dt.NewRow()
$dt.Rows.Add($row)
$date = [DateTime]$fruit.Attribute("ReferenceDate")
$row.ReferenceDate = $date
$name = $fruit.Attribute("FruitName").Value
$row.FruitName = $name
$identifier = $fruit.Element("Identifier")
$fruitIdentifier = $identifier.Attribute("FruitIdentifier").Value
$row.FruitIdentifier = $fruitIdentifier
$fruitBrand = $identifier.Attribute("FruitBrand").Value
$row.FruitBrand = $fruitBrand
$fruitInformation = $fruit.Element("FruitInformation")
$country = $fruitInformation.Attribute("Country").Value
$row.country = $country
$colour = $fruitInformation.Attribute("Colour").Value
$row.Colour = $colour
$companyInformation = $fruit.Element("CompanyInformation")
$companyName = $companyInformation.Attribute("CompanyName").Value
$row.CompanyName = $companyName
$location = $companyInformation.Attribute("Location").Value
$row.Location = $location
$languageDependent = $fruit.Descendants("LanguageDependent")
$countryId = $languageDependent.Attribute("CountryId").Value
$row.CountryId = $countryId
$languageId = $languageDependent.Attribute("LanguageId").Value
$row.languageId = $languageId
$freeText1 = $languageDependent.Element("FreeText1").Value
$row.FreeText1 = $freeText1
$freeText2 = $languageDependent.Element("FreeText2").Value
$row.FreeText2 = $freeText2
}
# convert the datatable to an array of strings
$table = $dt | Format-Table | Out-String -Stream
foreach ($row in $table) {
Write-Host $row
}