I have tried since a couple of months to standardize SEC filings. However, I have realized that the us-gaap tags have a different meaning per year per company.
Therefore, my goal is now to extract from the cal.xml files for each us-gaap sub-term the parent-term.
Example for the cal.xml file of the AAPL filing 2011-09-24: The parent-term of the sub-term "AccountsPayableCurrent" seems to be "LiabilitiesCurrent".
I would like to use the pandas.read_xml function. df = pd.read_xml('https://www.sec.gov/Archives/edgar/data/320193/000119312511282113/aapl-20110924_cal.xml')
However, the resulting df doesn't have a form where I can extract such an information. Does somebody know how to do it automatically for each ca.xml I wish it to do?
I have read in the documentation of pd.read_xml, that it can take a stylesheet (XSLT) as an argument. Is it somehow possible to create such an XSLT from the .xml or the related .xsd?
Thank you guys in advance. Please let me know how I can improve my question.
CodePudding user response:
Simply specify a needed xpath
to the section of nodes you intend to parse. Per docs, the default is first level ./*
:
import pandas as pd
import requests
url = (
"https://www.sec.gov/Archives/edgar/data/320193/"
"000119312511282113/aapl-20110924_cal.xml"
)
hdr = {
"user-agent":
(
"Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) "
"AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.107 "
"Mobile Safari/537.36"
)
}
r = requests.get(url, headers=hdr)
# roleRef NODES
roleRef_df = pd.read_xml(
r.text,
xpath = "//doc:roleRef",
namespaces = {"doc": "http://www.xbrl.org/2003/linkbase"}
)
# calculationLink NODES
calculationLink_df = pd.read_xml(
r.text,
xpath = "//doc:calculationLink",
namespaces = {"doc": "http://www.xbrl.org/2003/linkbase"}
)
# loc NODES
loc_df = pd.read_xml(
r.text,
xpath = "//doc:calculationLink/doc:loc",
namespaces = {"doc": "http://www.xbrl.org/2003/linkbase"}
)
# calculationArc NODES
calculationArc_df = pd.read_xml(
r.text,
xpath = "//doc:calculationLink/doc:calculationArc",
namespaces = {"doc": "http://www.xbrl.org/2003/linkbase"}
)
Should you need more extensive parsing such as retrieving attributes of the parent, calculationLink
, with its children loc
or calculationArc
, then consider XSLT.
xsl = '''<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:doc="http://www.xbrl.org/2003/linkbase">
<xsl:output method="xml" omit-xml-declaration="no" indent="yes"/>
<xsl:strip-space elements="*"/>
<xsl:template match="/*">
<xsl:copy>
<xsl:apply-templates select="descendant::doc:loc"/>
<xsl:apply-templates select="descendant::doc:calculationArc"/>
</xsl:copy>
</xsl:template>
<xsl:template match="doc:loc|doc:calculationArc">
<xsl:copy>
<xsl:copy-of select="ancestor::doc:calculationLink/@*"/>
<xsl:copy-of select="@*"/>
</xsl:copy>
</xsl:template>
</xsl:stylesheet>'''
calculationLink_loc_df = pd.read_xml(
r.text,
xpath = "//doc:loc",
namespaces = {"doc": "http://www.xbrl.org/2003/linkbase"},
stylesheet = xsl
)
calculationLink_arc_df = pd.read_xml(
r.text,
xpath = "//doc:calculationArc",
namespaces = {"doc": "http://www.xbrl.org/2003/linkbase"},
stylesheet = xsl
)
Output
calculationLink_loc_df.head()
# type role href label
# 0 locator http://www.apple.com/taxonomy/role/StatementOf... http://xbrl.fasb.org/us-gaap/2011/elts/us-gaap... us-gaap_CostOfGoodsAndServicesSold
# 1 locator http://www.apple.com/taxonomy/role/StatementOf... http://xbrl.fasb.org/us-gaap/2011/elts/us-gaap... us-gaap_GrossProfit
# 2 locator http://www.apple.com/taxonomy/role/StatementOf... http://xbrl.fasb.org/us-gaap/2011/elts/us-gaap... us-gaap_IncomeLossFromContinuingOperationsBefo...
# 3 locator http://www.apple.com/taxonomy/role/StatementOf... http://xbrl.fasb.org/us-gaap/2011/elts/us-gaap... us-gaap_IncomeTaxExpenseBenefit
# 4 locator http://www.apple.com/taxonomy/role/StatementOf... http://xbrl.fasb.org/us-gaap/2011/elts/us-gaap... us-gaap_NetIncomeLoss
calculationLink_arc_df.head()
# type role arcrole from to order weight priority use
# 0 arc http://www.apple.com/taxonomy/role/StatementOf... http://www.xbrl.org/2003/arcrole/summation-item us-gaap_GrossProfit us-gaap_SalesRevenueNet 1.01 1.0 2 optional
# 1 arc http://www.apple.com/taxonomy/role/StatementOf... http://www.xbrl.org/2003/arcrole/summation-item us-gaap_GrossProfit us-gaap_CostOfGoodsAndServicesSold 1.02 -1.0 2 optional
# 2 arc http://www.apple.com/taxonomy/role/StatementOf... http://www.xbrl.org/2003/arcrole/summation-item us-gaap_IncomeLossFromContinuingOperationsBefo... us-gaap_OperatingIncomeLoss 1.07 1.0 2 optional
# 3 arc http://www.apple.com/taxonomy/role/StatementOf... http://www.xbrl.org/2003/arcrole/summation-item us-gaap_IncomeLossFromContinuingOperationsBefo... us-gaap_NonoperatingIncomeExpense 1.08 1.0 2 optional
# 4 arc http://www.apple.com/taxonomy/role/StatementOf... http://www.xbrl.org/2003/arcrole/summation-item us-gaap_NetIncomeLoss us-gaap_IncomeLossFromContinuingOperationsBefo... 1.09 1.0 2 optional