I am scraping multiple webpages with the goal of getting the data from each webpage into one encompassing dataframe.
Problem: The r-script works for the most part, but as I expand the amount of webpages through expand.grid I get the error prompt below. I believe the error occurs when the scrape runs into an empty table along the way.
Error: Can't combine ..1$tables$player_name
and ..146$tables$player_name
.
Going Forward: When I remove the empty tables the code works fine. Is there an efficient way to solve this issue?
#scrape select stats from tournaments
df <- expand.grid(
tournament_id = c("t060", "t475", "t045", "t021"),
stat_id = c("02564", "101", "102", "336", "340"),
year_id = c("2004", "2005", "2006")
) %>%
mutate(
links = paste0(
'https://www.pgatour.com/stats/stat.',
stat_id,
'.y',
year_id,
'.eon.',
tournament_id,
'.html'
)
) %>%
as_tibble()
# Function to get the data combined into a table
get_info <- function(link, tournament_id) {
data <- link %>%
read_html() %>%
html_table() %>%
.[[2]] %>%
clean_names() %>%
select(-rank_last_week) %>%
mutate(rank_this_week = as.integer(str_extract(rank_this_week, "\\d ")))
try(data <- mutate(data, avg = as.character(avg)), silent = TRUE)
try(data <- mutate(data, gir_rank = as.character(gir_rank)), silent = TRUE)
try(data <- mutate(data, total_distance_feet = as.character(total_distance_feet)), silent = TRUE)
data
}
test101 <- df %>%
mutate(tables = map2(links, tournament_id, get_info)) %>%
tidyr::unnest(everything())
CodePudding user response:
This can be solved with purrr::possibly()
Minor adjustment to your code, player_name
is always converted to character.
get_info <- function(link, tournament_id) {
data <- link %>%
read_html() %>%
html_table() %>%
.[[2]] %>%
janitor::clean_names() %>%
select(-rank_last_week) %>%
mutate(rank_this_week = as.integer(str_extract(rank_this_week, "\\d ")),
player_name = as.character(player_name))
try(data <- mutate(data, avg = as.character(avg)), silent = TRUE)
try(data <- mutate(data, gir_rank = as.character(gir_rank)), silent = TRUE)
try(data <- mutate(data, total_distance_feet = as.character(total_distance_feet)), silent = TRUE)
return(data)
}
Wrap get_info
with possibly
to return empty tibble if no table is available.
test101 <- df %>%
mutate(tables = map2(links, tournament_id, possibly(get_info, otherwise = tibble())))
# A tibble: 60 x 5
tournament_id stat_id year_id links tables
<fct> <fct> <fct> <chr> <list>
1 t060 02564 2004 https://www.pgatour.com/stats/stat.02564.y2004.eon.t060.html <tibble [29 x 6]>
2 t475 02564 2004 https://www.pgatour.com/stats/stat.02564.y2004.eon.t475.html <tibble [72 x 6]>
3 t045 02564 2004 https://www.pgatour.com/stats/stat.02564.y2004.eon.t045.html <tibble [86 x 6]>
4 t021 02564 2004 https://www.pgatour.com/stats/stat.02564.y2004.eon.t021.html <tibble [0 x 2]>
5 t060 101 2004 https://www.pgatour.com/stats/stat.101.y2004.eon.t060.html <tibble [29 x 6]>
6 t475 101 2004 https://www.pgatour.com/stats/stat.101.y2004.eon.t475.html <tibble [72 x 6]>
7 t045 101 2004 https://www.pgatour.com/stats/stat.101.y2004.eon.t045.html <tibble [86 x 6]>
8 t021 101 2004 https://www.pgatour.com/stats/stat.101.y2004.eon.t021.html <tibble [0 x 2]>
9 t060 102 2004 https://www.pgatour.com/stats/stat.102.y2004.eon.t060.html <tibble [29 x 6]>
10 t475 102 2004 https://www.pgatour.com/stats/stat.102.y2004.eon.t475.html <tibble [72 x 6]>
11 t045 102 2004 https://www.pgatour.com/stats/stat.102.y2004.eon.t045.html <tibble [86 x 6]>
12 t021 102 2004 https://www.pgatour.com/stats/stat.102.y2004.eon.t021.html <tibble [0 x 2]>
13 t060 336 2004 https://www.pgatour.com/stats/stat.336.y2004.eon.t060.html <tibble [29 x 7]>
14 t475 336 2004 https://www.pgatour.com/stats/stat.336.y2004.eon.t475.html <tibble [72 x 7]>
15 t045 336 2004 https://www.pgatour.com/stats/stat.336.y2004.eon.t045.html <tibble [86 x 7]>
# ... with 45 more rows
# i Use `print(n = ...)` to see more rows
The column can then be unnest()
test101 %>%
unnest(everything())
# A tibble: 3,399 x 19
tournament_id stat_id year_id links rank_~1 playe~2 rounds average total~3 measu~4 avg total~5 total~6 percent fairw~7 possi~8 total~9 numbe~*
<fct> <fct> <fct> <chr> <int> <chr> <int> <dbl> <dbl> <int> <chr> <chr> <int> <dbl> <int> <int> <chr> <int>
1 t060 02564 2004 https~ 1 Stephe~ 4 1.77 7.09 4 NA NA NA NA NA NA NA NA
2 t060 02564 2004 https~ 2 Mike W~ 4 1.56 6.22 4 NA NA NA NA NA NA NA NA
3 t060 02564 2004 https~ 3 Tiger ~ 4 1.20 4.79 4 NA NA NA NA NA NA NA NA
4 t060 02564 2004 https~ 4 John D~ 4 0.93 3.72 4 NA NA NA NA NA NA NA NA
5 t060 02564 2004 https~ 5 Mark H~ 4 0.918 3.67 4 NA NA NA NA NA NA NA NA
6 t060 02564 2004 https~ 6 Zach J~ 4 0.893 3.57 4 NA NA NA NA NA NA NA NA
7 t060 02564 2004 https~ 7 Scott ~ 4 0.695 2.78 4 NA NA NA NA NA NA NA NA
8 t060 02564 2004 https~ 8 Jerry ~ 4 0.661 2.64 4 NA NA NA NA NA NA NA NA
9 t060 02564 2004 https~ 9 Stuart~ 4 0.624 2.50 4 NA NA NA NA NA NA NA NA
10 t060 02564 2004 https~ 10 Retief~ 4 0.612 2.45 4 NA NA NA NA NA NA NA NA
# ... with 3,389 more rows, 1 more variable: relative_to_par <chr>, and abbreviated variable names 1: rank_this_week, 2: player_name,
# 3: total_sg_putting, 4: measured_rounds, 5: total_distance, 6: total_drives, 7: fairways_hit, 8: possible_fairways, 9: total_distance_feet,
# *: number_of_attempts
# i Use `print(n = ...)` to see more rows, and `colnames()` to see all variable names