Python XML documents-CodePudding

I'm a bit new to XML and python. Below is a cut down version of a large XML file I'm trying to bring into python to eventually write into SQL Server db.

<?xml version="1.0" encoding="utf-8"?>
<MyOrgRefData:OrgRefData xmlns:MyOrgRefData="http://refdata.org/org/v2-0-0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://refdata.org/org/v2-0-0/MyOrgRefData.xsd">
  <Manifest>
    <Version value="2-0-0" />
    <PublicationType value="Full" />
    <PublicationSource value="TEST123" />
    <PublicationDate value="2022-05-23" />
    <PublicationSeqNum value="1659" />
    <FileCreationDateTime value="2022-05-23T22:14:47" />
    <RecordCount value="287654" />
    <ContentDescription value="FullFile_20220523" />
    <PrimaryRoleScope>
      <PrimaryRole id="123" displayName="Free beer for me" />
      <PrimaryRole id="456" displayName="Free air for you" />
    </PrimaryRoleScope>
  </Manifest>
  <CodeSystems>
    <CodeSystem name="OrganisationRecordClass" oid="1.2.3.4.5">
      <concept id="RC2" code="2" displayName="World1" />
      <concept id="RC1" code="1" displayName="World2" />
    </CodeSystem>
    <CodeSystem name="OrganisationRole" oid="5.4.7.8">
      <concept id="B1ng0" code="179" displayName="BoomBastic" />
      <concept id="R2D2a" code="180" displayName="Fantastic" />
    </CodeSystem>
  </CodeSystems>
</MyOrgRefData:OrgRefData>

I've tried with lxml, pandas.read_xml, xml.etree and I'm not able to understand how to get what I want.

Ideally I'd like to pull in Manifest into a dataframe ready to to send to SQL (pd.to_sql()). I would do the same with CodeSystems as well, but separately. (there are other sections but I cut them off to shorten)

For example, using pandas to read in, I can only get a column with the values in. But I would like to either have the tag (Version, PublicationType, PublicationSource etc) in a column by the side of the value, or have them as the column headers and the values pivoted across the row instead.

dataFolder = '/Some/directory'
df_bulk = pd.read_xml(
    dataFolder 'Data_Full_20220523.xml', 
    xpath='//Manifest/*', 
    attrs_only=True ,
    )
df_bulk.head()

This is the output I get:

inx	value
0	2-0-0
1	Full
2	TEST123
3	2022-05-23
4	1659
5	2022-05-23T22:14:47
6	287654
7	FullFile_20220523

Ideally I would like:

inx	value
Version	2-0-0
PublicationType	Full
PublicationSource	TEST123
PublicationDate	2022-05-23
PublicationSeqNum	1659
FileCreationDateTime	2022-05-23T22:14:47
FileCreationDateTime	287654
ContentDescription	FullFile_20220523

The eagle eyed among you will notice I've left out PrimaryRoleScope. I would ideally like to treat this separately in it's own dataframe as well. But I am unsure how to exclude it when pulling in the rest of the Manifest section.

Many thanks if you've read this far, even more thanks for any help.

CodePudding user response：

One possibility is using the stylesheet parameter to transform the XML data internally with XSLT before processing it.

So your code could look like this:

dataFolder = '/Some/directory'
df_bulk = pd.read_xml(
    dataFolder 'Data_Full_20220523.xml', 
    stylesheet='transform.xslt',
    xpath='/Root/Item', 
    attrs_only=True ,
    )
print(df_bulk.head(10))

The stylesheet(transform.xml) to be passed to read_xml could be (lxml is required)

<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
    <xsl:output method="xml" omit-xml-declaration="yes"/>    
       
    <xsl:template match="/">
        <Root><xsl:apply-templates /></Root>
    </xsl:template>

    <xsl:template match="//Manifest/*[not(self::PrimaryRoleScope)]">
        <Item name="{name()}" value="{@value}" />
    </xsl:template>
    
</xsl:stylesheet>

In this example a new XML like the following is created. It is intermediate XML and not shown, but the xpath= parameter above has to be set accordingly.

<Root>
    <Item name="Version" value="2-0-0"/>
    <Item name="PublicationType" value="Full"/>
    <Item name="PublicationSource" value="TEST123"/>
    <Item name="PublicationDate" value="2022-05-23"/>
    <Item name="PublicationSeqNum" value="1659"/>
    <Item name="FileCreationDateTime" value="2022-05-23T22:14:47"/>
    <Item name="RecordCount" value="287654"/>
    <Item name="ContentDescription" value="FullFile_20220523"/>
</Root>

And the final output is

                   name                value
0               Version                2-0-0
1       PublicationType                 Full
2     PublicationSource              TEST123
3       PublicationDate           2022-05-23
4     PublicationSeqNum                 1659
5  FileCreationDateTime  2022-05-23T22:14:47
6           RecordCount               287654
7    ContentDescription    FullFile_20220523

The above approach uses only attributes, but you could also create an element structure with the XSLT if you prefer that. In this case change one template to

<xsl:template match="//Manifest/*[not(self::PrimaryRoleScope)]">
    <Item>
        <name><xsl:value-of select="name()" /></name>
        <value><xsl:value-of select="@value" /></value>
   </Item>
</xsl:template>

and your python code to

dataFolder = '/Some/directory'
df_bulk = pd.read_xml(
    dataFolder 'Data_Full_20220523.xml', 
    stylesheet='transform.xslt',
    xpath='/Root/Item', 
    )
print(df_bulk.head(10))

The output is the same.