Downloading and reading XML SSDMX/ZIP file into R-CodePudding

I am in very unfamiliar terrain with downloading this file set from the Federal Reserve's website (I was hoping for an API but it doesn't appear to exist).

I am not sure what this Fed link format really is and how I would read the data into R.

I tried solutions here and here but I get different errors and not sure where to go. When I try something like this solution:

library(XML)

tf <- tempfile(tmpdir = tdir <- tempdir())

## Download the zip file 
download.file("https://www.federalreserve.gov/datadownload/Output.aspx?rel=Z1&filetype=zip", tf)

## Unzip it in the temp folder
xml_files <- unzip(tf, exdir = tdir)

## Parse the first file
doc <- xmlInternalTreeParse(xml_files[1])

I get Warning message: In unzip(tf, exdir = tdir) : zip file is corrupt

So basically, I would like to be able to get the data without having to first go to the website, download the excel version, and then read the excel version into R.

I am assuming there is a way to just reference this link https://www.federalreserve.gov/datadownload/Output.aspx?rel=Z1&filetype=zip to read the data into R without the manual steps associated with first downloading in excel.

The link for different download options for this data is here: https://www.federalreserve.gov/datadownload/Choose.aspx?rel=Z1

Thanks!

CodePudding user response：

Interestingly, I did not encounter the unzip warning on Linux but did on Windows and one notable difference of operating systems is character encoding.

Therefore, consider the non-default binary mode (i.e., mode = "wb") of download.file which does not raise a warning on Windows. This also makes sense for XML files which usually carry non-ASCII encoding such as UTF-8.

## Download the zip file 
url <- "https://www.federalreserve.gov/datadownload/Output.aspx?rel=Z1&filetype=zip"
download.file(url, tf, mode = "wb")

## Unzip it in the temp folder
xml_files <- unzip(tf, exdir = tdir)

In review, the Z1_data XML looks to be fairly flat and attribute-centric as shown in the head of the document with data observations nested under each series:

<?xml version="1.0" encoding="UTF-8" standalone="no" ?>  
<message:MessageGroup xmlns:message="http://www.SDMX.org/resources/SDMXML/schemas/v1_0/message"  xmlns:common="http://www.SDMX.org/resources/SDMXML/schemas/v1_0/common"  xmlns:frb="http://www.federalreserve.gov/structure/compact/common"  xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"  xsi:schemaLocation="http://www.SDMX.org/resources/SDMXML/schemas/v1_0/message SDMXMessage.xsd http://www.federalreserve.gov/structure/compact/common frb_common.xsd"><message:Header xmlns:message="http://www.SDMX.org/resources/SDMXML/schemas/v1_0/message">
    <message:ID>OTHER</message:ID>
    <message:Test>false</message:Test>
    <message:Name>Z.1 Statistical Release</message:Name>
    <message:Prepared>2022-06-08T23:46:08</message:Prepared>
    <message:Sender id="FRB">
      <message:Name>Federal Reserve Board</message:Name>
      <message:Contact>
        <message:Name>Financial Accounts of the United States Z.1 Coordinator</message:Name>
        <message:Telephone>(202)-452-3000</message:Telephone>
      </message:Contact>
    </message:Sender>
  </message:Header>
<frb:DataSet id="Z1" xmlns:kf="http://www.federalreserve.gov/structure/compact/Z1_Z1" xsi:schemaLocation="http://www.federalreserve.gov/structure/compact/Z1_Z1 Z1_Z1.xsd"  > 
<kf:Series CURRENCY="USD" FREQ="203" SERIES_INSTRUMENT="60100" SERIES_NAME="FA086010005.A" SERIES_PREFIX="FA" SERIES_SECTOR="08" SERIES_TYPE="5" UNIT="Currency" UNIT_MULT="1000000"  > 
<frb:Annotations>
<common:Annotation>
<common:AnnotationType>Short Description</common:AnnotationType>
<common:AnnotationText>U.S. national income </common:AnnotationText>
</common:Annotation>
<common:Annotation>
<common:AnnotationType>Long Description</common:AnnotationType>
<common:AnnotationText>U.S. national income </common:AnnotationText>
</common:Annotation>
</frb:Annotations>
<frb:Obs OBS_STATUS="A" OBS_VALUE="201261" TIME_PERIOD="1946-12-31" />
<frb:Obs OBS_STATUS="A" OBS_VALUE="218742" TIME_PERIOD="1947-12-31" />
<frb:Obs OBS_STATUS="A" OBS_VALUE="244849" TIME_PERIOD="1948-12-31" />
<frb:Obs OBS_STATUS="A" OBS_VALUE="239729" TIME_PERIOD="1949-12-31" />
<frb:Obs OBS_STATUS="A" OBS_VALUE="266613" TIME_PERIOD="1950-12-31" />
...

Therefore, consider using the undocumented, xmlAttrsToDataFrame, and assign namespace prefixes for XPath parsing, to build data frames of all series and select observations:

## Parse the first file
doc <- xmlInternalTreeParse(xml_files[1])

nmsp <- c(
  kf="http://www.federalreserve.gov/structure/compact/Z1_Z1",
  frb="http://www.federalreserve.gov/structure/compact/common"
)

# RETURN ALL SERIES IN DATA
data_series <- XML:::xmlAttrsToDataFrame(
  getNodeSet(doc, path="//kf:Series", namespaces=nmsp)
)
str(data_series)
# 'data.frame': 23039 obs. of  9 variables:
#  $ CURRENCY         : chr  "USD" "USD" "USD" "USD" ...
#  $ FREQ             : chr  "203" "162" "203" "162" ...
#  $ SERIES_INSTRUMENT: chr  "60100" "60100" "61300" "61300" ...
#  $ SERIES_NAME      : chr  "FA086010005.A" "FA086010005.Q" "FA086130003.A" "FA086130003.Q" ...
#  $ SERIES_PREFIX    : chr  "FA" "FA" "FA" "FA" ...
#  $ SERIES_SECTOR    : chr  "08" "08" "08" "08" ...
#  $ SERIES_TYPE      : chr  "5" "5" "3" "3" ...
#  $ UNIT             : chr  "Currency" "Currency" "Currency" "Currency" ...
#  $ UNIT_MULT        : chr  "1000000" "1000000" "1000000" "1000000" ...

# RETURN OBSERVATIONS OF A SINGLE SERIES
FA086010005.A <- XML:::xmlAttrsToDataFrame(
  getNodeSet(doc, path="//kf:Series[@SERIES_NAME='FA086010005.A']/frb:Obs", namespaces=nmsp)
)

head(FA086010005.A, 10)
#    OBS_STATUS OBS_VALUE TIME_PERIOD
# 1           A    201261  1946-12-31
# 2           A    218742  1947-12-31
# 3           A    244849  1948-12-31
# 4           A    239729  1949-12-31
# 5           A    266613  1950-12-31
# 6           A    307637  1951-12-31
# 7           A    326112  1952-12-31
# 8           A    343840  1953-12-31
# 9           A    343753  1954-12-31
# 10          A    376903  1955-12-31

tail(FA086010005.A, 10)
#    OBS_STATUS OBS_VALUE TIME_PERIOD
# 67          A  14099598  2012-12-31
# 68          A  14507115  2013-12-31
# 69          A  15228067  2014-12-31
# 70          A  15749523  2015-12-31
# 71          A  16033408  2016-12-31
# 72          A  16774920  2017-12-31
# 73          A  17673340  2018-12-31
# 74          A  18273065  2019-12-31
# 75          A  17710721  2020-12-31
# 76          A  19937976  2021-12-31

CodePudding user response：

I suspect this should work:

tf <- tempfile(tmpdir = tdir <- tempdir())

## Download the zip file 
download.file("https://www.federalreserve.gov/datadownload/Output.aspx?rel=Z1&filetype=zip", tf)

## Unzip it in the temp folder
xml_files <- unzip(tf, exdir = tdir)


library(rvest)
doc <- read_html(xml_files[1])

(I'm trying it now, since the file Z1_data.xml is 526mb it takes some time to parse, but I can't see any reason it shouldn't work)