I am trying to efficiently scrape weekly tournament data from pgatour.com, and place the results in one encompassing table. Below, is an example link that I will use:
Currently, my lapply is cycling through both id's at the same time and I am only getting 3 of the possible 9 combinations. Is there a way to change my lapply call to cycle through both id's in the desired manner?
library(rvest)
library(dplyr)
library(stringr)
tournament_id <- c("t041", "t054", "t464")
stat_id <- c("02568", "02567", "02564")
url_g <- c(paste('https://www.pgatour.com/stats/stat.', stat_id, '.y2019.eon.', tournament_id,'.html', sep =""))
test_table_pga4 <- lapply(url_g, function(i){
page2 <- read_html(i)
test_table_pga5 <- page2 %>% html_nodes("#statsTable") %>% html_table() %>% .[[1]] %>%
mutate(tournament = i)
})
test_golf7 <- as_tibble(rbind.fill(test_table_pga4))
CodePudding user response:
Use expand.grid()
to create unique combinations of stat_id
and tournament_id
and then mutate a new column with those links.
library(tidyverse)
library(janitor)
library(rvest)
df <- expand.grid(
tournament_id = c("t041", "t054", "t464"),
stat_id = c("02568", "02567", "02564")
) %>%
mutate(
links = paste0(
'https://www.pgatour.com/stats/stat.',
stat_id,
'.y2019.eon.',
tournament_id,
'.html'
)
) %>%
as_tibble()
# Function to get the table
get_info <- function(link, tournament) {
link %>%
read_html() %>%
html_table() %>%
.[[2]] %>%
clean_names() %>%
select(-rank_last_week ) %>%
mutate(rank_this_week = rank_this_week %>%
as.character,
tournament = tournament) %>%
relocate(tournament)
}
# Retrieve the tables and bind them
df %$%
map2_dfr(links, tournament_id, get_info)
# A tibble: 648 × 9
tournament rank_this_week player_name rounds average total_sg_app
<fct> <chr> <chr> <int> <dbl> <dbl>
1 t041 1 Corey Conners 4 2.89 11.6
2 t041 2 Matt Kuchar 4 2.16 8.62
3 t041 3 Byeong Hun An 4 1.90 7.60
4 t041 4 Charley Hoffman 4 1.72 6.88
5 t041 5 Ryan Moore 4 1.43 5.73
6 t041 6 Brian Stuard 4 1.42 5.69
7 t041 7 Danny Lee 4 1.30 5.18
8 t041 8 Cameron Tringale 4 1.22 4.88
9 t041 9 Si Woo Kim 4 1.22 4.87
10 t041 10 Scottie Scheffler 4 1.16 4.62
# … with 638 more rows, and 3 more variables: measured_rounds <int>,
# total_sg_ott <dbl>, total_sg_putting <dbl>