Question about using rvest and purrr for scraping multiple pages with nested links-CodePudding

I have written the code below to extract all speeches given by the FLOTUS at this link. Here is the code:

library(rvest)
library(purrr)

url_base <- "https://www.presidency.ucsb.edu/documents/presidential-documents-archive-guidebook/remarks-and-statements-the-first-lady-laura-bush?page=%d"

map_df(1:17, function(i) {

  # simple but effective progress indicator
  cat(".")

  pg <- read_html(sprintf(url_base, i))

  data.frame(name=html_text(html_nodes(pg, ".views-field-title-1.nowrap")),
             title=html_text(html_nodes(pg, "td.views-field-title")),
             year=html_text(html_nodes(pg, ".date-display-single")),
             stringsAsFactors=FALSE)

}) -> flotus

I would like to use this code to extract the text of the corresponding speeches as well. Does anyone know how to do that with the code I've already written? If so, what would that look like?

CodePudding user response：

One needs to retrieve the 'href' attribute links from the table's Title column using the html_attr() function.

library(rvest)
library(purrr)

url_base <- "https://www.presidency.ucsb.edu/documents/presidential-documents-archive-guidebook/remarks-and-statements-the-first-lady-laura-bush?page="

flotus <-map_df(1:16, function(i) {
   
   # simple but effective progress indicator
   cat(".")
   
   pg <- read_html(paste0(url_base, i))
   
   #parse the table
   df <- html_node(pg, "table") %>% html_table()
   
   #obtain the href from the table's Title column
   df$links <-html_nodes(pg, "td.views-field-title") %>% 
                         html_node("a") %>% html_attr("href")
   df
})

The above code will add the link to the speech as an additional column in the data frame.

Part Two To extract the text of the speeches, retrieve the list of links and then loop through the list, open the page, extract the desired information, store it and repeat.

#limited the number of pages request for debugging
map_df(flotus$links[1:3], function(link){
   print(link)
   #Read page
   page <- read_html(link)
   #extract the content and other info
   content <- page %>% html_node("div.field-docs-content") %>% html_text() %>% trimws()
   person <- page %>% html_node("div.field-docs-person") %>% html_text() %>% trimws()
   citation <- page %>% html_node("div.field-prez-document-citation") %>% html_text() %>% trimws()
   
   #add it to a data struture
   data.frame(content, person, citation)
   
   Sys.sleep(1) #Be polite - add a pause to prevent the appearance of attacking the server
})

Here all of the data is stored in a dataframe. This data frame can then be joined with the dataframe from up above depending on future intentions.