I tried to scrape the data table here: https://databases.usatoday.com/mlb-salaries-2022/.
Below was the code I had. I read the source code and found that table is not one of the class used. Appearantly, the method suggested on several tutorial websites of scraping data tables on a website wouldn't work. One possible reason could be the data table on the USA Today is not allowed to scrape, but I have no clue. I only need to know how to scrape the table on the first page and then I should be able to get all tables from all pages.
I appreciate any suggestions or help. Thanks!
library(rvest)
page <- read_html("https://databases.usatoday.com/mlb-salaries-2022/")
page %>% html_nodes("table") %>%
.[[1]] %>%
html_table()
The result shows "Error in .[[1]] : subscript out of bounds" since there is "table" is not a class used in the source code of the web page.
CodePudding user response:
As suggested by Dave2e, you cannot scrape with rvest
on this site. Make a request with httr2
as such:
library(tidyverse)
library(httr2)
get_baseball <- function(page) {
cat("Getting Major League page", page, "\n")
str_c("https://databases.usatoday.com/wp-admin/admin-ajax.php") %>%
request() %>%
req_body_form(
'action' = "cspFetchTable",
'security' = "f78301d4bd",
'pageID' = 330,
'sortBy' = "Salary",
'sortOrder' = "desc",
'page' = page,
'searches' = "{}",
'heads' = "true"
) %>%
req_perform() %>%
resp_body_json(check_type = FALSE) %>%
getElement("data") %>%
getElement("Result") %>%
map( ~ .x[1:8] %>%
replace(NULL, NA)) %>%
bind_rows()
}
df <- map_dfr(1:49, get_baseball)
# A tibble: 971 x 8
PK_ID Player Team Position Salary Years Total_value Average_Annual
<int> <chr> <chr> <chr> <int> <chr> <int> <int>
1 550 Scherzer, Max N.Y. Mets RHP 43333333 3 (2022-24) 130000000 43333333
2 391 Trout, Mike L.A. Angels OF 37116667 12 (2019-30) 426500000 35541667
3 392 Rendon, Anthony L.A. Angels 3B 36571429 7 (2020-26) 245000000 35000000
4 582 Cole, Gerrit N.Y. Yankees RHP 36000000 9 (2020-28) 324000000 36000000
5 519 Correa, Carlos Minnesota SS 35100000 3 (2022-24) 105300000 35100000
6 714 Machado, Manny San Diego 3B 34000000 10 (2019-28) 300000000 30000000
7 876 Seager, Corey Texas SS 33000000 10 (2022-31) 325000000 32500000
8 812 Arenado, Nolan St. Louis 3B 32974847 8 (2019-26) 260000000 32500000
9 551 Lindor, Francisco N.Y. Mets SS 32477277 10 (2022-31) 341000000 34100000
10 938 Strasburg, Stephen Washington RHP 32205854 7 (2020-26) 245000000 35000000
# ... with 961 more rows
# i Use `print(n = ...)` to see more rows