Purrr add new columns to a data frame that are an output from a map function call-CodePudding

I am working with a data frame (call it full_df) that contains links which I want to use to scrape two further links. This is a sample for the data frame:

structure(list(CIK = c("1082339", "1276755", "1280511"), COMPANY_NAME = c("COLDSTREAM CAPITAL MANAGEMENT INC", 
"CHELSEA COUNSEL CO", "QUANTUM CAPITAL MANAGEMENT"), FORM_TYPE = c("13F-HR", 
"13F-HR", "13F-HR"), FILE_DATE = c("2020-05-27", "2020-06-12", 
"2020-05-26"), FORM_LINK = c("edgar/data/1082339/0001082339-20-000002.txt", 
"edgar/data/1276755/0001420506-20-000683.txt", "edgar/data/1280511/0001280511-20-000003.txt"
), QTR_YEAR = c("Q22020", "Q22020", "Q22020"), FULL_LINK = c("https://www.sec.gov/Archives/edgar/data/1082339/0001082339-20-000002-index.htm", 
"https://www.sec.gov/Archives/edgar/data/1276755/0001420506-20-000683-index.htm", 
"https://www.sec.gov/Archives/edgar/data/1280511/0001280511-20-000003-index.htm"
)), row.names = c(NA, 3L), class = "data.frame")

I would like to iterate over the FULL_LINK column and obtain two further links that I would then want to add to my original data frame as two new columns - xml_link and html_link.

I can get the links using a function that i have written like so (with a single link used an an example here):

library(polite)
library(rvest)
library(glue)
library(tidyverse)

test_link <- "https://www.sec.gov/Archives/edgar/data/1082339/0001082339-20-000002-index.htm"

ua = 'Kartik P (for personal use)'


session <- bow("https://www.sec.gov/",
               user_agent = ua)

xml_scraper <- function(urll) {
  print(glue("Scraping: {urll}"))
  
  temp_link <- session %>%
    nod(urll) %>%
    scrape(verbose = FALSE) %>%
    html_nodes("a") %>%
    html_attr('href') 
  
  xml_link <- temp_link %>% 
    nth(12)
  html_link <- temp_link %>% 
    nth(11)
  return(data.frame(xml_link, html_link))
}

Great! this works as expected and returns a data frame with two columns that I want

xml_scraper(test_link)
Scraping: https://www.sec.gov/Archives/edgar/data/1082339/0001082339-20-000002-index.htm
                                                           xml_link
1 /Archives/edgar/data/1082339/000108233920000002/CCMI13F2020Q1.xml
                                                                         html_link
1 /Archives/edgar/data/1082339/000108233920000002/xslForm13F_X01/CCMI13F2020Q1.xml

However, what I would like to do is to iterate over each element of the FULL_LINK column in the full_df and add the two new links as elements of newly created xml_link and html_link column in the original data frame. It feels like this should be doable with purr::map_dfr and a bind_cols call or mutating two names variables simultaneously, but I am unable to figure out the syntax.

Would appreciate any suggestions on how to get this to work with dplyr and purrr.

Thanks in advance.

CodePudding user response：

Maybe:

df_new <- bind_cols(map_dfr(df$FULL_LINK, xml_scraper), df)

Result:


#> # A tibble: 3 × 9
#>   xml_link  html_link  CIK   COMPANY_NAME FORM_TYPE FILE_DATE FORM_LINK QTR_YEAR
#>   <chr>     <chr>      <chr> <chr>        <chr>     <chr>     <chr>     <chr>   
#> 1 /Archive… /Archives… 1082… COLDSTREAM … 13F-HR    2020-05-… edgar/da… Q22020  
#> 2 /Archive… /Archives… 1276… CHELSEA COU… 13F-HR    2020-06-… edgar/da… Q22020  
#> 3 /Archive… /Archives… 1280… QUANTUM CAP… 13F-HR    2020-05-… edgar/da… Q22020  
#> # … with 1 more variable: FULL_LINK <chr>

^{Created on 2022-01-01 by the reprex package (v2.0.1)}

CodePudding user response：

You can just mutate the data set using your xml_scraper function. You need do the mutate "rowwise", since your function isn't vectorized.


data_full<-data %>% 
  rowwise() %>%
  mutate(xml_link=xml_scraper(FULL_LINK) %>% pluck("xml_link"),
         html_link=xml_scraper(FULL_LINK) %>% pluck("html_link"))

#If you want just the results of the scrape, you can use map
the_xml<-data %>%
  split(1:nrow(.)) %>%
  map(~pluck(.x$"FULL_LINK")) %>%
  map(xml_scraper) %>%
  bind_rows()

CodePudding user response：

You can edit your function to output also FULL_LINK and use it to join the 2 new columns to your original data

xml_scraper <- function(urll) {
  print(glue("Scraping: {urll}"))
  
  temp_link <- session %>%
    nod(urll) %>%
    scrape(verbose = FALSE) %>%
    html_nodes("a") %>%
    html_attr('href') 
  
  xml_link <- temp_link %>% 
    nth(12)
  html_link <- temp_link %>% 
    nth(11)

return(data.frame(FULL_LINK = urll, xml_link, html_link))

}

Then

data2 <- map_dfr(data$FULL_LINK, .f = xml_scrapper) %>%
  left_join(data, ., by = "FULL_LINK")