Home > Net >  How to Bypass Empty Tables in Web-scrape in r
How to Bypass Empty Tables in Web-scrape in r

Time:11-12

I am scraping multiple webpages with the goal of getting the data from each webpage into one encompassing dataframe.

Problem: The r-script works for the most part, but as I expand the amount of webpages through expand.grid I get the error prompt below. I believe the error occurs when the scrape runs into an empty table along the way.

Error: Can't combine ..1$tables$player_name and ..146$tables$player_name .

Going Forward: When I remove the empty tables the code works fine. Is there an efficient way to solve this issue?

#scrape select stats from tournaments
df <- expand.grid(
  tournament_id = c("t060", "t475", "t045", "t021"),
  stat_id = c("02564", "101", "102", "336", "340"),
  year_id = c("2004", "2005", "2006")
) %>% 
  mutate(
    links = paste0(
      'https://www.pgatour.com/stats/stat.',
      stat_id,
      '.y',
      year_id,
      '.eon.',
      tournament_id,
      '.html'
    )
  ) %>% 
  as_tibble()

# Function to get the data combined into a table
get_info <- function(link, tournament_id) {
  data <- link %>%
    read_html() %>%
    html_table() %>%
    .[[2]] %>%
    clean_names() %>% 
    select(-rank_last_week) %>% 
    mutate(rank_this_week = as.integer(str_extract(rank_this_week, "\\d ")))
  try(data <- mutate(data, avg = as.character(avg)), silent = TRUE)
  try(data <- mutate(data, gir_rank = as.character(gir_rank)), silent = TRUE)
  try(data <- mutate(data, total_distance_feet = as.character(total_distance_feet)), silent = TRUE)
  data
}

test101 <- df %>%
  mutate(tables = map2(links, tournament_id, get_info)) %>%
  tidyr::unnest(everything())

CodePudding user response:

This can be solved with purrr::possibly() Minor adjustment to your code, player_name is always converted to character.

get_info <- function(link, tournament_id) {
  data <- link %>%
    read_html() %>%
    html_table() %>%
    .[[2]] %>%
    janitor::clean_names() %>% 
    select(-rank_last_week) %>% 
    mutate(rank_this_week = as.integer(str_extract(rank_this_week, "\\d ")), 
           player_name = as.character(player_name))
  try(data <- mutate(data, avg = as.character(avg)), silent = TRUE)
  try(data <- mutate(data, gir_rank = as.character(gir_rank)), silent = TRUE)
  try(data <- mutate(data, total_distance_feet = as.character(total_distance_feet)), silent = TRUE)
  return(data)
}

Wrap get_info with possibly to return empty tibble if no table is available.

test101 <- df %>%
  mutate(tables = map2(links, tournament_id, possibly(get_info, otherwise = tibble())))

# A tibble: 60 x 5
   tournament_id stat_id year_id links                                                        tables           
   <fct>         <fct>   <fct>   <chr>                                                        <list>           
 1 t060          02564   2004    https://www.pgatour.com/stats/stat.02564.y2004.eon.t060.html <tibble [29 x 6]>
 2 t475          02564   2004    https://www.pgatour.com/stats/stat.02564.y2004.eon.t475.html <tibble [72 x 6]>
 3 t045          02564   2004    https://www.pgatour.com/stats/stat.02564.y2004.eon.t045.html <tibble [86 x 6]>
 4 t021          02564   2004    https://www.pgatour.com/stats/stat.02564.y2004.eon.t021.html <tibble [0 x 2]> 
 5 t060          101     2004    https://www.pgatour.com/stats/stat.101.y2004.eon.t060.html   <tibble [29 x 6]>
 6 t475          101     2004    https://www.pgatour.com/stats/stat.101.y2004.eon.t475.html   <tibble [72 x 6]>
 7 t045          101     2004    https://www.pgatour.com/stats/stat.101.y2004.eon.t045.html   <tibble [86 x 6]>
 8 t021          101     2004    https://www.pgatour.com/stats/stat.101.y2004.eon.t021.html   <tibble [0 x 2]> 
 9 t060          102     2004    https://www.pgatour.com/stats/stat.102.y2004.eon.t060.html   <tibble [29 x 6]>
10 t475          102     2004    https://www.pgatour.com/stats/stat.102.y2004.eon.t475.html   <tibble [72 x 6]>
11 t045          102     2004    https://www.pgatour.com/stats/stat.102.y2004.eon.t045.html   <tibble [86 x 6]>
12 t021          102     2004    https://www.pgatour.com/stats/stat.102.y2004.eon.t021.html   <tibble [0 x 2]> 
13 t060          336     2004    https://www.pgatour.com/stats/stat.336.y2004.eon.t060.html   <tibble [29 x 7]>
14 t475          336     2004    https://www.pgatour.com/stats/stat.336.y2004.eon.t475.html   <tibble [72 x 7]>
15 t045          336     2004    https://www.pgatour.com/stats/stat.336.y2004.eon.t045.html   <tibble [86 x 7]>
# ... with 45 more rows
# i Use `print(n = ...)` to see more rows

The column can then be unnest()

test101 %>% 
  unnest(everything())
# A tibble: 3,399 x 19
   tournament_id stat_id year_id links  rank_~1 playe~2 rounds average total~3 measu~4 avg   total~5 total~6 percent fairw~7 possi~8 total~9 numbe~*
   <fct>         <fct>   <fct>   <chr>    <int> <chr>    <int>   <dbl>   <dbl>   <int> <chr> <chr>     <int>   <dbl>   <int>   <int> <chr>     <int>
 1 t060          02564   2004    https~       1 Stephe~      4   1.77     7.09       4 NA    NA           NA      NA      NA      NA NA           NA
 2 t060          02564   2004    https~       2 Mike W~      4   1.56     6.22       4 NA    NA           NA      NA      NA      NA NA           NA
 3 t060          02564   2004    https~       3 Tiger ~      4   1.20     4.79       4 NA    NA           NA      NA      NA      NA NA           NA
 4 t060          02564   2004    https~       4 John D~      4   0.93     3.72       4 NA    NA           NA      NA      NA      NA NA           NA
 5 t060          02564   2004    https~       5 Mark H~      4   0.918    3.67       4 NA    NA           NA      NA      NA      NA NA           NA
 6 t060          02564   2004    https~       6 Zach J~      4   0.893    3.57       4 NA    NA           NA      NA      NA      NA NA           NA
 7 t060          02564   2004    https~       7 Scott ~      4   0.695    2.78       4 NA    NA           NA      NA      NA      NA NA           NA
 8 t060          02564   2004    https~       8 Jerry ~      4   0.661    2.64       4 NA    NA           NA      NA      NA      NA NA           NA
 9 t060          02564   2004    https~       9 Stuart~      4   0.624    2.50       4 NA    NA           NA      NA      NA      NA NA           NA
10 t060          02564   2004    https~      10 Retief~      4   0.612    2.45       4 NA    NA           NA      NA      NA      NA NA           NA
# ... with 3,389 more rows, 1 more variable: relative_to_par <chr>, and abbreviated variable names 1: rank_this_week, 2: player_name,
#   3: total_sg_putting, 4: measured_rounds, 5: total_distance, 6: total_drives, 7: fairways_hit, 8: possible_fairways, 9: total_distance_feet,
#   *: number_of_attempts
# i Use `print(n = ...)` to see more rows, and `colnames()` to see all variable names
  • Related