Home > Mobile >  Converting a xml page to a data frame
Converting a xml page to a data frame

Time:09-01

I am trying to use the R package, in order to get into a dataframe the values of the 'nonDerivativeTable' node of the following US government table, reporting insider trading data.

https://www.sec.gov/Archives/edgar/data/1274494/000112760222021374/form4.xml

I have tried both the XML and the XML2 packages and followed the example given here: R: convert XML data to data frame

But I had no luck, probably because the two xml pages are very different,

I would really appreciate any help more specifically to my example. Thanks in advance,

CodePudding user response:

There are two problems with this URL:

  1. It does not accept requests from R directly. We have to simulate that we use a browser to access it. We can do this by using the right request headers:
library(xml2)
library(httr2)
resp <- request("https://www.sec.gov/Archives/edgar/data/1274494/000112760222021374/form4.xml") |> 
  req_headers(
    "Accept-Encoding" = "gzip, deflate, sdch",
    "Accept-Language" = "en-US,en;q=0.8",
    "Upgrade-Insecure-Requests" = "1",
    "User-Agent" = "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36",
    "Accept" = "text/html,application/xhtml xml,application/xml;q=0.9,image/webp,*/*;q=0.8",
    "Cache-Control" = "max-age=0",
    "Connection" = "keep-alive"
  ) |> 
  req_perform() |> 
  resp_body_xml()

Now we have received an xml object. We can convert this into a list:

resp_l <- resp |> 
  as_list() |> 
  unlist(recursive = FALSE)
  1. The data in the xml is not rectangular. If it was, we could simply convert this to a data.frame but some entries are longer than others. Some are nested etc. The only thing we can do is look through the object (e.g., with ´View(resp_l)) and look for the information you want. Then you can piece together a data.frame:
data.frame(
  issuerName = resp_l[["ownershipDocument.issuer"]][["issuerName"]][[1]],
  rptOwnerStreet1 = resp_l[["ownershipDocument.reportingOwner"]][["reportingOwnerAddress"]][["rptOwnerStreet1"]][[1]],
  rptOwnerStreet2 = resp_l[["ownershipDocument.reportingOwner"]][["reportingOwnerAddress"]][["rptOwnerStreet2"]][[1]],
  rptOwnerCity = resp_l[["ownershipDocument.reportingOwner"]][["reportingOwnerAddress"]][["rptOwnerCity"]][[1]],
  rptOwnerState = resp_l[["ownershipDocument.reportingOwner"]][["reportingOwnerAddress"]][["rptOwnerState"]][[1]],
  rptOwnerZipCode = resp_l[["ownershipDocument.reportingOwner"]][["reportingOwnerAddress"]][["rptOwnerZipCode"]][[1]]
)
#>          issuerName       rptOwnerStreet1                       rptOwnerStreet2
#> 1 FIRST SOLAR, INC. C/O FIRST SOLAR, INC. 350 WEST WASHINGTON STREET, SUITE 600
#>   rptOwnerCity rptOwnerState rptOwnerZipCode
#> 1        TEMPE            AZ      85281-1244

Created on 2022-08-31 by the reprex package (v2.0.1)

  • Related