Have never tried parsing xml markup for data extraction hitherto. And now I've got the task to extract data from the statistical source. My naïve approach is to rely solely on pandas.read_xml() as per below:
import io
import pandas as pd
import requests
from zipfile import ZipFile
pd.options.display.max_columns = 8
URL = "https://www.federalreserve.gov/datadownload/Output.aspx?rel=g17&filetype=zip"
r = requests.get(URL)
with ZipFile(io.BytesIO(r.content)) as z:
# =========================================================================
# Select the Largest File Containing the Most of the Data
# =========================================================================
_map = {_.file_size: _.filename for _ in z.filelist}
with z.open(_map[max(_map)]) as f:
df = pd.read_xml(f)
print(df)
Within pd.read_xml(f), there is the following markup:
<?xml version="1.0" encoding="UTF-8" standalone="no" ?> <message:MessageGroup xmlns:message="http://www.SDMX.org/resources/SDMXML/schemas/v1_0/message" xmlns:common="http://www.SDMX.org/resources/SDMXML/schemas/v1_0/common" xmlns:frb="http://www.federalreserve.gov/structure/compact/common" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.SDMX.org/resources/SDMXML/schemas/v1_0/message SDMXMessage.xsd http://www.federalreserve.gov/structure/compact/common frb_common.xsd"><message:Header xmlns:message="http://www.SDMX.org/resources/SDMXML/schemas/v1_0/message">
<message:ID>G17</message:ID>
<message:Test>false</message:Test>
<message:Name>G.17 - Industrial Production and Capacity Utilization</message:Name>
<message:Prepared>2022-07-15T00:38:22</message:Prepared>
<message:Sender id="FRB">
<message:Name>Federal Reserve Board</message:Name>
<message:Contact>
<message:Name>Public Affairs</message:Name>
<message:Telephone>(202) 452 - 3204</message:Telephone>
</message:Contact>
</message:Sender>
</message:Header>
<frb:DataSet id="IP_MAJOR_INDUSTRY_GROUPS" xmlns:kf="http://www.federalreserve.gov/structure/compact/G17_IP_MAJOR_INDUSTRY_GROUPS" xsi:schemaLocation="http://www.federalreserve.gov/structure/compact/G17_IP_MAJOR_INDUSTRY_GROUPS G17_IP_MAJOR_INDUSTRY_GROUPS.xsd" >
<kf:Series CURRENCY="NA" FREQ="129" SA="SA" SERIES_CODE="B50001" SERIES_NAME="IP.B50001.S" UNIT="Index:_2017_100" UNIT_MULT="1" >
<frb:Annotations>
<common:Annotation>
<common:AnnotationType>Short Description</common:AnnotationType>
<common:AnnotationText>Total index; s.a. IP</common:AnnotationText>
</common:Annotation>
<common:Annotation>
<common:AnnotationType>Long Description</common:AnnotationType>
<common:AnnotationText>Total index; s.a. IP</common:AnnotationText>
</common:Annotation>
</frb:Annotations>
<frb:Obs OBS_STATUS="A" OBS_VALUE="4.8773" TIME_PERIOD="1919-01-31" />
<frb:Obs OBS_STATUS="A" OBS_VALUE="4.6617" TIME_PERIOD="1919-02-28" />
<frb:Obs OBS_STATUS="A" OBS_VALUE="4.5270" TIME_PERIOD="1919-03-31" />
<frb:Obs OBS_STATUS="A" OBS_VALUE="4.6078" TIME_PERIOD="1919-04-30" />
<frb:Obs OBS_STATUS="A" OBS_VALUE="4.6348" TIME_PERIOD="1919-05-31" />
... Skip ...
<frb:Obs OBS_STATUS="A" OBS_VALUE="102.9981" TIME_PERIOD="2022-02-28" />
<frb:Obs OBS_STATUS="A" OBS_VALUE="103.7286" TIME_PERIOD="2022-03-31" />
<frb:Obs OBS_STATUS="A" OBS_VALUE="104.5224" TIME_PERIOD="2022-04-30" />
<frb:Obs OBS_STATUS="A" OBS_VALUE="104.5729" TIME_PERIOD="2022-05-31" />
<frb:Obs OBS_STATUS="A" OBS_VALUE="104.3648" TIME_PERIOD="2022-06-30" />
</kf:Series>
... Many More Lines Omitted To the End of the Document ...
Desirable output for the shown sample would be:
TIME_PERIOD IP.B50001.S
1919-01-31 4.8773
1919-02-28 4.6617
1919-03-31 4.5270
1919-04-30 4.6078
1919-05-31 4.6348
... Skip ...
2022-02-28 102.9981
2022-03-31 103.7286
2022-04-30 104.5224
2022-05-31 104.5729
2022-06-30 104.3648
For the moment, it doesn't return the desirable result as the whole target data seems to be omitted. The output I get is:
ID Test Name \
0 G17 false G.17 - Industrial Production and Capacity Util...
1 None None None
2 None None None
3 None None None
4 None None None
5 None None None
6 None None None
7 None None None
8 None None None
9 None None None
10 None None None
11 None None None
12 None None None
13 None None None
14 None None None
Prepared Sender id \
0 2022-07-15T00:38:22 NaN None
1 None NaN IP_MAJOR_INDUSTRY_GROUPS
2 None NaN IP_DURABLE_GOODS_DETAIL
3 None NaN IP_NONDURABLE_GOODS_DETAIL
4 None NaN IP_MINING_AND_UTILITY_DETAIL
5 None NaN IP_MARKET_GROUPS
6 None NaN IP_SPECIAL_AGGREGATES
7 None NaN IP_GROSS_VALUE_STAGE_OF_PROCESS_GROUPS
8 None NaN MVA
9 None NaN DIFF
10 None NaN CAP
11 None NaN CAPUTL
12 None NaN GVIP
13 None NaN RIW
14 None NaN KW
schemaLocation Series
0 None NaN
1 http://www.federalreserve.gov/structure/compac... NaN
2 http://www.federalreserve.gov/structure/compac... NaN
3 http://www.federalreserve.gov/structure/compac... NaN
4 http://www.federalreserve.gov/structure/compac... NaN
5 http://www.federalreserve.gov/structure/compac... NaN
6 http://www.federalreserve.gov/structure/compac... NaN
7 http://www.federalreserve.gov/structure/compac... NaN
8 http://www.federalreserve.gov/structure/compac... NaN
9 http://www.federalreserve.gov/structure/compac... NaN
10 http://www.federalreserve.gov/structure/compac... NaN
11 http://www.federalreserve.gov/structure/compac... NaN
12 http://www.federalreserve.gov/structure/compac... NaN
13 http://www.federalreserve.gov/structure/compac... NaN
14 http://www.federalreserve.gov/structure/compac... NaN
That looks more like the table of contents rather than the target data itself.
Currently, the task breaks up two-fold:
What kwargs shall I pass to pandas.read_xml() to get more meaningful data?
If the more meaningful data is retrieved successfully, the next step would be to make use of the auxilliary files within the zip-archive (say, the file <G17_IP_MAJOR_INDUSTRY_GROUPS.xsd>) to assign the column names if the output of the first step is not much human readable.
Thank you!
CodePudding user response:
There are a few problems with your sample xml, but assuming it's fixed the way I believe it's supposed to be (it's too long to paste here), then read_xml()
is probably unable to parse it so that your expected output is generated.
Instead I would go through lxml first and then pass the output of certain xpath expressions (see below) to pandas to create the dataframe:
from lxml import etree
tree = etree.parse(f)
doc = tree.getroot()
#your expected columns:
cols = ["TIME_PERIOD", "IP.B50001.S"]
#the base xpath expression
expr = '//*[local-name()="Obs"]'
rows = []
for r in doc.xpath(expr):
row = []
#use more xpath expressions to get to the target attributes
row.extend([r.xpath('.//@TIME_PERIOD')[0], r.xpath('.//@OBS_VALUE')[0]])
rows.append(row)
frdf = pd.DataFrame(rows, columns=cols)
frdf
Output:
TIME_PERIOD IP.B50001.S
0 1919-01-31 4.8773
1 1919-02-28 4.6617
2 1919-03-31 4.5270
etc.