Home > Back-end >  Web Scraping Using Multiple Variables in Link
Web Scraping Using Multiple Variables in Link

Time:07-19

I am trying to efficiently scrape weekly tournament data from pgatour.com, and place the results in one encompassing table. Below, is an example link that I will use:

enter image description here

Currently, my lapply is cycling through both id's at the same time and I am only getting 3 of the possible 9 combinations. Is there a way to change my lapply call to cycle through both id's in the desired manner?

library(rvest)
library(dplyr)
library(stringr)

tournament_id <- c("t041", "t054", "t464")
stat_id <- c("02568", "02567", "02564")
url_g <- c(paste('https://www.pgatour.com/stats/stat.', stat_id, '.y2019.eon.', tournament_id,'.html', sep =""))

test_table_pga4 <- lapply(url_g, function(i){
  page2 <- read_html(i)
  test_table_pga5 <- page2 %>% html_nodes("#statsTable") %>% html_table() %>% .[[1]] %>% 
    mutate(tournament = i)    
})

test_golf7 <- as_tibble(rbind.fill(test_table_pga4))

CodePudding user response:

Use expand.grid() to create unique combinations of stat_id and tournament_id and then mutate a new column with those links.

library(tidyverse)
library(janitor)
library(rvest)

df <- expand.grid(
  tournament_id = c("t041", "t054", "t464"),
  stat_id = c("02568", "02567", "02564")
) %>% 
  mutate(
    links = paste0(
      'https://www.pgatour.com/stats/stat.',
      stat_id,
      '.y2019.eon.',
      tournament_id,
      '.html'
    )
  ) %>% 
  as_tibble()

# Function to get the table
get_info <- function(link, tournament) {
  link %>%
    read_html() %>%
    html_table() %>%
    .[[2]] %>%
    clean_names() %>% 
    select(-rank_last_week ) %>% 
    mutate(rank_this_week = rank_this_week %>% 
             as.character, 
           tournament = tournament) %>% 
    relocate(tournament)
}


# Retrieve the tables and bind them
df %$%
  map2_dfr(links, tournament_id, get_info) 

# A tibble: 648 × 9
   tournament rank_this_week player_name       rounds average total_sg_app
   <fct>      <chr>          <chr>              <int>   <dbl>        <dbl>
 1 t041       1              Corey Conners          4    2.89        11.6 
 2 t041       2              Matt Kuchar            4    2.16         8.62
 3 t041       3              Byeong Hun An          4    1.90         7.60
 4 t041       4              Charley Hoffman        4    1.72         6.88
 5 t041       5              Ryan Moore             4    1.43         5.73
 6 t041       6              Brian Stuard           4    1.42         5.69
 7 t041       7              Danny Lee              4    1.30         5.18
 8 t041       8              Cameron Tringale       4    1.22         4.88
 9 t041       9              Si Woo Kim             4    1.22         4.87
10 t041       10             Scottie Scheffler      4    1.16         4.62
# … with 638 more rows, and 3 more variables: measured_rounds <int>,
#   total_sg_ott <dbl>, total_sg_putting <dbl>
  • Related