I have code that scrapes a website but does so in a way that after so many scrapes from a run, I get a 403 forbidden error. I understand there is a package in R called polite that does the work of figuring out how to run the scrape to the hosts requirements so the 403 won't occur. I tried my best at adapting it to my code but I'm stuck. Would really appreciate some help. Here is some sample reproducible code with just a few links from many:
library(tidyverse)
library(httr)
library(rvest)
library(curl)
urls = c("https://www.pro-football-reference.com/teams/pit/2021.htm", "https://www.pro-
football-reference.com/teams/pit/2020.htm", "https://www.pro-football-
reference.com/teams/pit/2019.htm")
pitt <- map_dfr(
.x = urls,
.f = function(x) {Sys.sleep(2); cat(1);
read_html(
curl(x, handle = curl::new_handle("useragent" = "chrome"))) %>%
html_nodes("table") %>%
html_table(header = TRUE) %>%
simplify() %>%
.[[2]] %>%
janitor::row_to_names(row_number = 1) %>%
janitor::clean_names(.) %>%
select(week, day, date, result = x_2, record = rec, opponent = opp, team_score = tm, opponent_score = opp_2) %>%
mutate(year = str_extract(string = x, pattern = "\\d{4}"))
}
)
This run should work no problem but the full run includes all years 1933-2021 instead of just the three year links provided in the example. I'm open to any way to responsibly scrape this using the polite package or any other an expert might be more familiar with.
CodePudding user response:
Here is my suggestion how to use polite in this scenario. The code creates a grid of teams and seasons and politely scrapes the data.
The parser is taken from your example.
library(magrittr)
# Create polite session
host <- "https://www.pro-football-reference.com/"
session <- polite::bow(host, force = TRUE)
# Create grid of teams and seasons that shall be scraped
seasons <- 2020:2021
teams <- c("pit", "nor")
grid_to_scrape <- tidyr::expand_grid(team = teams, season = seasons)
grid_to_scrape
#> # A tibble: 4 × 2
#> team season
#> <chr> <int>
#> 1 pit 2020
#> 2 pit 2021
#> 3 nor 2020
#> 4 nor 2021
responses <- purrr::pmap_dfr(grid_to_scrape, function(team, season, session){
# For some verbose status updates
cli::cli_process_start("Scrape {.val {team}}, {.val {season}}")
# Create full url and scrape
full_url <- polite::nod(session, glue::glue("teams/{team}/{season}.htm"))
scrape <- polite::scrape(full_url)
# Parse the response, suppress Janitor warnings. This is a problem of the parser
suppressWarnings({
response <- scrape %>%
rvest::html_elements("table") %>%
rvest::html_table(header = TRUE) %>%
purrr::simplify() %>%
.[[2]] %>%
janitor::row_to_names(row_number = 1) %>%
janitor::clean_names() %>%
dplyr::select(week, day, date, result = x_2, record = rec, opponent = opp, team_score = tm, opponent_score = opp_2) %>%
dplyr::mutate(year = season, team = team)
})
# Update status
cli::cli_process_done()
# return parsed data
response
}, session = session)
#> ℹ Scrape "pit", 2020
#> ✓ Scrape "pit", 2020 ... done
#>
#> ℹ Scrape "pit", 2021
#> ✓ Scrape "pit", 2021 ... done
#>
#> ℹ Scrape "nor", 2020
#> ✓ Scrape "nor", 2020 ... done
#>
#> ℹ Scrape "nor", 2021
#> ✓ Scrape "nor", 2021 ... done
#>
responses
#> # A tibble: 77 × 10
#> week day date result record opponent team_score opponent_score year
#> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <int>
#> 1 1 "Mon" "Septembe… "boxs… "1-0" New Yor… "26" "16" 2020
#> 2 2 "Sun" "Septembe… "boxs… "2-0" Denver … "26" "21" 2020
#> 3 3 "Sun" "Septembe… "boxs… "3-0" Houston… "28" "21" 2020
#> 4 4 "" "" "" "" Bye Week "" "" 2020
#> 5 5 "Sun" "October … "boxs… "4-0" Philade… "38" "29" 2020
#> 6 6 "Sun" "October … "boxs… "5-0" Clevela… "38" "7" 2020
#> 7 7 "Sun" "October … "boxs… "6-0" Tenness… "27" "24" 2020
#> 8 8 "Sun" "November… "boxs… "7-0" Baltimo… "28" "24" 2020
#> 9 9 "Sun" "November… "boxs… "8-0" Dallas … "24" "19" 2020
#> 10 10 "Sun" "November… "boxs… "9-0" Cincinn… "36" "10" 2020
#> # … with 67 more rows, and 1 more variable: team <chr>
Created on 2022-02-22 by the reprex package (v2.0.1)