Home > Mobile >  How to parse SEC cal.xml files correctly with pd.read_xml?
How to parse SEC cal.xml files correctly with pd.read_xml?

Time:10-31

I have tried since a couple of months to standardize SEC filings. However, I have realized that the us-gaap tags have a different meaning per year per company.

Therefore, my goal is now to extract from the cal.xml files for each us-gaap sub-term the parent-term.

Example for the cal.xml file of the AAPL filing 2011-09-24: The parent-term of the sub-term "AccountsPayableCurrent" seems to be "LiabilitiesCurrent".

I would like to use the pandas.read_xml function. df = pd.read_xml('https://www.sec.gov/Archives/edgar/data/320193/000119312511282113/aapl-20110924_cal.xml')

However, the resulting df doesn't have a form where I can extract such an information. Does somebody know how to do it automatically for each ca.xml I wish it to do?

I have read in the documentation of pd.read_xml, that it can take a stylesheet (XSLT) as an argument. Is it somehow possible to create such an XSLT from the .xml or the related .xsd?

Thank you guys in advance. Please let me know how I can improve my question.

CodePudding user response:

Simply specify a needed xpath to the section of nodes you intend to parse. Per docs, the default is first level ./*:

import pandas as pd
import requests

url = (
    "https://www.sec.gov/Archives/edgar/data/320193/"
    "000119312511282113/aapl-20110924_cal.xml"
)
hdr = {
    "user-agent": 
    (
       "Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) "
       "AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.107 "
       "Mobile Safari/537.36"
    )
}

r = requests.get(url, headers=hdr)

# roleRef NODES
roleRef_df = pd.read_xml(
    r.text,
    xpath = "//doc:roleRef",
    namespaces = {"doc": "http://www.xbrl.org/2003/linkbase"}
)

# calculationLink NODES
calculationLink_df = pd.read_xml(
    r.text,
    xpath = "//doc:calculationLink",
    namespaces = {"doc": "http://www.xbrl.org/2003/linkbase"}
)

# loc NODES
loc_df = pd.read_xml(
    r.text,
    xpath = "//doc:calculationLink/doc:loc",
    namespaces = {"doc": "http://www.xbrl.org/2003/linkbase"}
)

# calculationArc NODES
calculationArc_df = pd.read_xml(
    r.text,
    xpath = "//doc:calculationLink/doc:calculationArc",
    namespaces = {"doc": "http://www.xbrl.org/2003/linkbase"}
)

Should you need more extensive parsing such as retrieving attributes of the parent, calculationLink, with its children loc or calculationArc, then consider XSLT.

xsl = '''<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
                              xmlns:doc="http://www.xbrl.org/2003/linkbase">
    <xsl:output method="xml" omit-xml-declaration="no" indent="yes"/>
    <xsl:strip-space elements="*"/>

    <xsl:template match="/*">
     <xsl:copy>
       <xsl:apply-templates select="descendant::doc:loc"/>
       <xsl:apply-templates select="descendant::doc:calculationArc"/>
     </xsl:copy>
    </xsl:template>
    
    <xsl:template match="doc:loc|doc:calculationArc">
     <xsl:copy>
       <xsl:copy-of select="ancestor::doc:calculationLink/@*"/>
       <xsl:copy-of select="@*"/>
     </xsl:copy>
    </xsl:template>
</xsl:stylesheet>'''

calculationLink_loc_df = pd.read_xml(
    r.text,
    xpath = "//doc:loc",
    namespaces = {"doc": "http://www.xbrl.org/2003/linkbase"},
    stylesheet = xsl
)

calculationLink_arc_df = pd.read_xml(
    r.text,
    xpath = "//doc:calculationArc",
    namespaces = {"doc": "http://www.xbrl.org/2003/linkbase"},
    stylesheet = xsl
)

Output

calculationLink_loc_df.head()
#       type                                               role                                               href                                              label
# 0  locator  http://www.apple.com/taxonomy/role/StatementOf...  http://xbrl.fasb.org/us-gaap/2011/elts/us-gaap...                 us-gaap_CostOfGoodsAndServicesSold
# 1  locator  http://www.apple.com/taxonomy/role/StatementOf...  http://xbrl.fasb.org/us-gaap/2011/elts/us-gaap...                                us-gaap_GrossProfit
# 2  locator  http://www.apple.com/taxonomy/role/StatementOf...  http://xbrl.fasb.org/us-gaap/2011/elts/us-gaap...  us-gaap_IncomeLossFromContinuingOperationsBefo...
# 3  locator  http://www.apple.com/taxonomy/role/StatementOf...  http://xbrl.fasb.org/us-gaap/2011/elts/us-gaap...                    us-gaap_IncomeTaxExpenseBenefit
# 4  locator  http://www.apple.com/taxonomy/role/StatementOf...  http://xbrl.fasb.org/us-gaap/2011/elts/us-gaap...                              us-gaap_NetIncomeLoss


calculationLink_arc_df.head()

#   type                                               role                                          arcrole                                               from                                                 to  order  weight  priority       use
# 0  arc  http://www.apple.com/taxonomy/role/StatementOf...  http://www.xbrl.org/2003/arcrole/summation-item                                us-gaap_GrossProfit                            us-gaap_SalesRevenueNet   1.01     1.0         2  optional
# 1  arc  http://www.apple.com/taxonomy/role/StatementOf...  http://www.xbrl.org/2003/arcrole/summation-item                                us-gaap_GrossProfit                 us-gaap_CostOfGoodsAndServicesSold   1.02    -1.0         2  optional
# 2  arc  http://www.apple.com/taxonomy/role/StatementOf...  http://www.xbrl.org/2003/arcrole/summation-item  us-gaap_IncomeLossFromContinuingOperationsBefo...                        us-gaap_OperatingIncomeLoss   1.07     1.0         2  optional
# 3  arc  http://www.apple.com/taxonomy/role/StatementOf...  http://www.xbrl.org/2003/arcrole/summation-item  us-gaap_IncomeLossFromContinuingOperationsBefo...                  us-gaap_NonoperatingIncomeExpense   1.08     1.0         2  optional
# 4  arc  http://www.apple.com/taxonomy/role/StatementOf...  http://www.xbrl.org/2003/arcrole/summation-item                              us-gaap_NetIncomeLoss  us-gaap_IncomeLossFromContinuingOperationsBefo...   1.09     1.0         2  optional
  • Related