Polite Webscraping with Rvest in R-CodePudding

I have code that scrapes a website but does so in a way that after so many scrapes from a run, I get a 403 forbidden error. I understand there is a package in R called polite that does the work of figuring out how to run the scrape to the hosts requirements so the 403 won't occur. I tried my best at adapting it to my code but I'm stuck. Would really appreciate some help. Here is some sample reproducible code with just a few links from many:

library(tidyverse)
library(httr) 
library(rvest)
library(curl)

urls = c("https://www.pro-football-reference.com/teams/pit/2021.htm", "https://www.pro- 
football-reference.com/teams/pit/2020.htm", "https://www.pro-football- 
reference.com/teams/pit/2019.htm")


pitt <- map_dfr(
.x = urls,
 .f = function(x) {Sys.sleep(2); cat(1);
 read_html(
  curl(x, handle = curl::new_handle("useragent" = "chrome"))) %>% 
  html_nodes("table") %>% 
  html_table(header = TRUE) %>% 
  simplify() %>%
  .[[2]] %>% 
  janitor::row_to_names(row_number = 1) %>% 
  janitor::clean_names(.) %>% 
  select(week, day, date, result = x_2, record = rec, opponent = opp, team_score = tm, opponent_score = opp_2) %>% 
  mutate(year = str_extract(string = x, pattern = "\\d{4}"))
 }
)

This run should work no problem but the full run includes all years 1933-2021 instead of just the three year links provided in the example. I'm open to any way to responsibly scrape this using the polite package or any other an expert might be more familiar with.

CodePudding user response：

Here is my suggestion how to use polite in this scenario. The code creates a grid of teams and seasons and politely scrapes the data.

The parser is taken from your example.

library(magrittr)

# Create polite session
host <- "https://www.pro-football-reference.com/"
session <- polite::bow(host, force = TRUE)

# Create grid of teams and seasons that shall be scraped
seasons <- 2020:2021
teams <- c("pit", "nor")
grid_to_scrape <- tidyr::expand_grid(team = teams, season = seasons)
grid_to_scrape
#> # A tibble: 4 × 2
#>   team  season
#>   <chr>  <int>
#> 1 pit     2020
#> 2 pit     2021
#> 3 nor     2020
#> 4 nor     2021

responses <- purrr::pmap_dfr(grid_to_scrape, function(team, season, session){
  # For some verbose status updates
  cli::cli_process_start("Scrape {.val {team}}, {.val {season}}")
  # Create full url and scrape
  full_url <- polite::nod(session, glue::glue("teams/{team}/{season}.htm"))
  scrape <- polite::scrape(full_url)
  # Parse the response, suppress Janitor warnings. This is a problem of the parser
  suppressWarnings({
    response <- scrape %>% 
      rvest::html_elements("table") %>% 
      rvest::html_table(header = TRUE) %>% 
      purrr::simplify() %>%
      .[[2]] %>%
      janitor::row_to_names(row_number = 1) %>% 
      janitor::clean_names() %>% 
      dplyr::select(week, day, date, result = x_2, record = rec, opponent = opp, team_score = tm, opponent_score = opp_2) %>% 
      dplyr::mutate(year = season, team = team)
  })
  # Update status
  cli::cli_process_done()
  # return parsed data
  response
}, session = session)
#> ℹ Scrape "pit", 2020
#> ✓ Scrape "pit", 2020 ... done
#> 
#> ℹ Scrape "pit", 2021
#> ✓ Scrape "pit", 2021 ... done
#> 
#> ℹ Scrape "nor", 2020
#> ✓ Scrape "nor", 2020 ... done
#> 
#> ℹ Scrape "nor", 2021
#> ✓ Scrape "nor", 2021 ... done
#> 

responses
#> # A tibble: 77 × 10
#>    week  day   date       result record opponent team_score opponent_score  year
#>    <chr> <chr> <chr>      <chr>  <chr>  <chr>    <chr>      <chr>          <int>
#>  1 1     "Mon" "Septembe… "boxs… "1-0"  New Yor… "26"       "16"            2020
#>  2 2     "Sun" "Septembe… "boxs… "2-0"  Denver … "26"       "21"            2020
#>  3 3     "Sun" "Septembe… "boxs… "3-0"  Houston… "28"       "21"            2020
#>  4 4     ""    ""         ""     ""     Bye Week ""         ""              2020
#>  5 5     "Sun" "October … "boxs… "4-0"  Philade… "38"       "29"            2020
#>  6 6     "Sun" "October … "boxs… "5-0"  Clevela… "38"       "7"             2020
#>  7 7     "Sun" "October … "boxs… "6-0"  Tenness… "27"       "24"            2020
#>  8 8     "Sun" "November… "boxs… "7-0"  Baltimo… "28"       "24"            2020
#>  9 9     "Sun" "November… "boxs… "8-0"  Dallas … "24"       "19"            2020
#> 10 10    "Sun" "November… "boxs… "9-0"  Cincinn… "36"       "10"            2020
#> # … with 67 more rows, and 1 more variable: team <chr>

^{Created on 2022-02-22 by the reprex package (v2.0.1)}