I'm a bit new to XML and python. Below is a cut down version of a large XML file I'm trying to bring into python to eventually write into SQL Server db.
<?xml version="1.0" encoding="utf-8"?>
<MyOrgRefData:OrgRefData xmlns:MyOrgRefData="http://refdata.org/org/v2-0-0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://refdata.org/org/v2-0-0/MyOrgRefData.xsd">
<Manifest>
<Version value="2-0-0" />
<PublicationType value="Full" />
<PublicationSource value="TEST123" />
<PublicationDate value="2022-05-23" />
<PublicationSeqNum value="1659" />
<FileCreationDateTime value="2022-05-23T22:14:47" />
<RecordCount value="287654" />
<ContentDescription value="FullFile_20220523" />
<PrimaryRoleScope>
<PrimaryRole id="123" displayName="Free beer for me" />
<PrimaryRole id="456" displayName="Free air for you" />
</PrimaryRoleScope>
</Manifest>
<CodeSystems>
<CodeSystem name="OrganisationRecordClass" oid="1.2.3.4.5">
<concept id="RC2" code="2" displayName="World1" />
<concept id="RC1" code="1" displayName="World2" />
</CodeSystem>
<CodeSystem name="OrganisationRole" oid="5.4.7.8">
<concept id="B1ng0" code="179" displayName="BoomBastic" />
<concept id="R2D2a" code="180" displayName="Fantastic" />
</CodeSystem>
</CodeSystems>
</MyOrgRefData:OrgRefData>
I've tried with lxml, pandas.read_xml, xml.etree and I'm not able to understand how to get what I want.
Ideally I'd like to pull in Manifest into a dataframe ready to to send to SQL (pd.to_sql()). I would do the same with CodeSystems as well, but separately. (there are other sections but I cut them off to shorten)
For example, using pandas to read in, I can only get a column with the values in. But I would like to either have the tag (Version, PublicationType, PublicationSource etc) in a column by the side of the value, or have them as the column headers and the values pivoted across the row instead.
dataFolder = '/Some/directory'
df_bulk = pd.read_xml(
dataFolder 'Data_Full_20220523.xml',
xpath='//Manifest/*',
attrs_only=True ,
)
df_bulk.head()
This is the output I get:
inx | value |
---|---|
0 | 2-0-0 |
1 | Full |
2 | TEST123 |
3 | 2022-05-23 |
4 | 1659 |
5 | 2022-05-23T22:14:47 |
6 | 287654 |
7 | FullFile_20220523 |
Ideally I would like:
inx | value |
---|---|
Version | 2-0-0 |
PublicationType | Full |
PublicationSource | TEST123 |
PublicationDate | 2022-05-23 |
PublicationSeqNum | 1659 |
FileCreationDateTime | 2022-05-23T22:14:47 |
FileCreationDateTime | 287654 |
ContentDescription | FullFile_20220523 |
The eagle eyed among you will notice I've left out PrimaryRoleScope. I would ideally like to treat this separately in it's own dataframe as well. But I am unsure how to exclude it when pulling in the rest of the Manifest section.
Many thanks if you've read this far, even more thanks for any help.
CodePudding user response:
One possibility is using the stylesheet
parameter to transform the XML data internally with XSLT before processing it.
So your code could look like this:
dataFolder = '/Some/directory'
df_bulk = pd.read_xml(
dataFolder 'Data_Full_20220523.xml',
stylesheet='transform.xslt',
xpath='/Root/Item',
attrs_only=True ,
)
print(df_bulk.head(10))
The stylesheet(transform.xml
) to be passed to read_xml
could be (lxml is required)
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="xml" omit-xml-declaration="yes"/>
<xsl:template match="/">
<Root><xsl:apply-templates /></Root>
</xsl:template>
<xsl:template match="//Manifest/*[not(self::PrimaryRoleScope)]">
<Item name="{name()}" value="{@value}" />
</xsl:template>
</xsl:stylesheet>
In this example a new XML like the following is created. It is intermediate XML and not shown, but the xpath=
parameter above has to be set accordingly.
<Root>
<Item name="Version" value="2-0-0"/>
<Item name="PublicationType" value="Full"/>
<Item name="PublicationSource" value="TEST123"/>
<Item name="PublicationDate" value="2022-05-23"/>
<Item name="PublicationSeqNum" value="1659"/>
<Item name="FileCreationDateTime" value="2022-05-23T22:14:47"/>
<Item name="RecordCount" value="287654"/>
<Item name="ContentDescription" value="FullFile_20220523"/>
</Root>
And the final output is
name value
0 Version 2-0-0
1 PublicationType Full
2 PublicationSource TEST123
3 PublicationDate 2022-05-23
4 PublicationSeqNum 1659
5 FileCreationDateTime 2022-05-23T22:14:47
6 RecordCount 287654
7 ContentDescription FullFile_20220523
The above approach uses only attributes, but you could also create an element structure with the XSLT if you prefer that. In this case change one template to
<xsl:template match="//Manifest/*[not(self::PrimaryRoleScope)]">
<Item>
<name><xsl:value-of select="name()" /></name>
<value><xsl:value-of select="@value" /></value>
</Item>
</xsl:template>
and your python code to
dataFolder = '/Some/directory'
df_bulk = pd.read_xml(
dataFolder 'Data_Full_20220523.xml',
stylesheet='transform.xslt',
xpath='/Root/Item',
)
print(df_bulk.head(10))
The output is the same.