I am trying to use the R package, in order to get into a dataframe the values of the 'nonDerivativeTable' node of the following US government table, reporting insider trading data.
https://www.sec.gov/Archives/edgar/data/1274494/000112760222021374/form4.xml
I have tried both the XML and the XML2 packages and followed the example given here: R: convert XML data to data frame
But I had no luck, probably because the two xml pages are very different,
I would really appreciate any help more specifically to my example. Thanks in advance,
CodePudding user response:
There are two problems with this URL:
- It does not accept requests from R directly. We have to simulate that we use a browser to access it. We can do this by using the right request headers:
library(xml2)
library(httr2)
resp <- request("https://www.sec.gov/Archives/edgar/data/1274494/000112760222021374/form4.xml") |>
req_headers(
"Accept-Encoding" = "gzip, deflate, sdch",
"Accept-Language" = "en-US,en;q=0.8",
"Upgrade-Insecure-Requests" = "1",
"User-Agent" = "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36",
"Accept" = "text/html,application/xhtml xml,application/xml;q=0.9,image/webp,*/*;q=0.8",
"Cache-Control" = "max-age=0",
"Connection" = "keep-alive"
) |>
req_perform() |>
resp_body_xml()
Now we have received an xml object. We can convert this into a list:
resp_l <- resp |>
as_list() |>
unlist(recursive = FALSE)
- The data in the xml is not rectangular. If it was, we could simply convert this to a data.frame but some entries are longer than others. Some are nested etc. The only thing we can do is look through the object (e.g., with
´View(resp_l)
) and look for the information you want. Then you can piece together a data.frame:
data.frame(
issuerName = resp_l[["ownershipDocument.issuer"]][["issuerName"]][[1]],
rptOwnerStreet1 = resp_l[["ownershipDocument.reportingOwner"]][["reportingOwnerAddress"]][["rptOwnerStreet1"]][[1]],
rptOwnerStreet2 = resp_l[["ownershipDocument.reportingOwner"]][["reportingOwnerAddress"]][["rptOwnerStreet2"]][[1]],
rptOwnerCity = resp_l[["ownershipDocument.reportingOwner"]][["reportingOwnerAddress"]][["rptOwnerCity"]][[1]],
rptOwnerState = resp_l[["ownershipDocument.reportingOwner"]][["reportingOwnerAddress"]][["rptOwnerState"]][[1]],
rptOwnerZipCode = resp_l[["ownershipDocument.reportingOwner"]][["reportingOwnerAddress"]][["rptOwnerZipCode"]][[1]]
)
#> issuerName rptOwnerStreet1 rptOwnerStreet2
#> 1 FIRST SOLAR, INC. C/O FIRST SOLAR, INC. 350 WEST WASHINGTON STREET, SUITE 600
#> rptOwnerCity rptOwnerState rptOwnerZipCode
#> 1 TEMPE AZ 85281-1244
Created on 2022-08-31 by the reprex package (v2.0.1)