Rvest scraping child nodes but filling missing values with NA-CodePudding

I am trying to scrape some data from the sec website. Each parent node has child nodes that contains text of interest. However, in some cases a particular child node does not exist. So for example in this link:

urll <- "https://www.sec.gov/Archives/edgar/data/1002784/000139834421003391/fp0061633_13fhr-table.xml"

There are 728 parent nodes. Each parent node has a number of entries that are child nodes that have a specific tag. Here is an example of one full entry (of the 728):

<infoTable>
<nameOfIssuer>APPLE INC</nameOfIssuer>
<titleOfClass>COM</titleOfClass>
<cusip>037833100</cusip>
<value>1486</value>
<shrsOrPrnAmt>
<sshPrnamt>11200</sshPrnamt>
<sshPrnamtType>SH</sshPrnamtType>
</shrsOrPrnAmt>
<putCall>Put</putCall>
<investmentDiscretion>SOLE</investmentDiscretion>
<votingAuthority>
<Sole>11200</Sole>
<Shared>0</Shared>
<None>0</None>
</votingAuthority>
</infoTable>

In this example the "putCall" tag may or may not exist. When it exists I want to be able to get the relevant text, so "Put" in this instance. However for this link, only 8 of the 728 parent nodes have the "putCall" node. I want to fill the nodes where there is no "putCall" node with NA so that I always have the 728 entries for each tag that I can coerce into a data frame. So for example this is what I have tried so far inspired by Inputting NA where there are missing values when scraping with rvest.

library(polite)
library(rvest)
library(purrr)
library(tidyverse)
library(httr)


session <- bow("https://www.sec.gov/")

urll <- "https://www.sec.gov/Archives/edgar/data/1002784/000139834421003391/fp0061633_13fhr-table.xml"

test <- session %>%
  nod(urll) %>%
  scrape(verbose = FALSE) %>%
  html_elements(xpath = "//*[local-name()='infoTable']") %>% # select enclosing nodes
  # iterate over each parent node, pulling out desired parts and coerce to data.frame
  # not the complete list
  map_df(
    ~ list(
      name_of_issuer = html_elements(.x, xpath = "//*[local-name()='nameOfIssuer']") %>%
        html_text() %>%
        {
          if (length(.) == 0)
            NA
          else
            .
        },
      title_of_class = html_elements(.x, xpath = "//*[local-name()='titleOfClass']") %>%
        html_text() %>%
        {
          if (length(.) == 0)
            NA
          else
            .
        },
      put_or_call = html_elements(.x, xpath = "//*[local-name()='putCall']") %>%
        html_text() %>%
        {
          if (length(.) == 0)
            NA
          else
            .
        }))

This fails with the error message:

Error: Can't recycle `name_of_issuer` (size 728) to match `put_or_call` (size 8).

It seems that the NA fill in not working for the "putCall" node and it only returns a list of 8 entries.

Any suggestions on what I am doing wrong and how to fix it?

Thanks much!

CodePudding user response：

If I simply use httr then I can pass in a valid UA header and re-write your code to instead use a data.frame call, instead of list, that way I can return N/A where value not present.

Swap out html_elements for html_element.

You also need to amend your xpaths to avoid getting the first node value repeated for each row.

library(tidyverse)
library(httr)

headers <- c("User-Agent" = "Safari/537.36")

r <- httr::GET(url = "https://www.sec.gov/Archives/edgar/data/1002784/000139834421003391/fp0061633_13fhr-table.xml", httr::add_headers(.headers = headers))

r %>%
  content() %>%
  html_elements(xpath = "//*[local-name()='infoTable']") %>% # select enclosing nodes
  # iterate over each parent node, pulling out desired parts and coerce to data.frame
  # not the complete list
  map_df(
    ~ data.frame(
      name_of_issuer = html_element(.x, xpath = ".//*[local-name()='nameOfIssuer']") %>%
        html_text(),
      title_of_class = html_element(.x, xpath = ".//*[local-name()='titleOfClass']") %>%
        html_text(),
      put_or_call = html_element(.x, xpath = ".//*[local-name()='putCall']") %>%
        html_text()
    )
  )