Home > Software engineering >  scrape a data table from USA Today database
scrape a data table from USA Today database

Time:11-13

I tried to scrape the data table here: https://databases.usatoday.com/mlb-salaries-2022/.

Below was the code I had. I read the source code and found that table is not one of the class used. Appearantly, the method suggested on several tutorial websites of scraping data tables on a website wouldn't work. One possible reason could be the data table on the USA Today is not allowed to scrape, but I have no clue. I only need to know how to scrape the table on the first page and then I should be able to get all tables from all pages.

I appreciate any suggestions or help. Thanks!

library(rvest)

page <- read_html("https://databases.usatoday.com/mlb-salaries-2022/")
page %>% html_nodes("table") %>%
  .[[1]] %>%
  html_table()

The result shows "Error in .[[1]] : subscript out of bounds" since there is "table" is not a class used in the source code of the web page.

CodePudding user response:

As suggested by Dave2e, you cannot scrape with rvest on this site. Make a request with httr2 as such:

library(tidyverse)
library(httr2)

get_baseball <- function(page) {
  cat("Getting Major League page", page, "\n")
  str_c("https://databases.usatoday.com/wp-admin/admin-ajax.php") %>%
    request() %>%
    req_body_form(
      'action' = "cspFetchTable",
      'security' = "f78301d4bd",
      'pageID' = 330,
      'sortBy' = "Salary",
      'sortOrder' = "desc",
      'page' = page,
      'searches' = "{}",
      'heads' = "true"
    ) %>%
    req_perform() %>%
    resp_body_json(check_type = FALSE) %>%
    getElement("data") %>%
    getElement("Result") %>%
    map( ~ .x[1:8] %>% 
           replace(NULL, NA)) %>%
    bind_rows()
}

df <- map_dfr(1:49, get_baseball)

# A tibble: 971 x 8
   PK_ID Player             Team         Position   Salary Years        Total_value Average_Annual
   <int> <chr>              <chr>        <chr>       <int> <chr>              <int>          <int>
 1   550 Scherzer, Max      N.Y. Mets    RHP      43333333 3 (2022-24)    130000000       43333333
 2   391 Trout, Mike        L.A. Angels  OF       37116667 12 (2019-30)   426500000       35541667
 3   392 Rendon, Anthony    L.A. Angels  3B       36571429 7 (2020-26)    245000000       35000000
 4   582 Cole, Gerrit       N.Y. Yankees RHP      36000000 9 (2020-28)    324000000       36000000
 5   519 Correa, Carlos     Minnesota    SS       35100000 3 (2022-24)    105300000       35100000
 6   714 Machado, Manny     San Diego    3B       34000000 10 (2019-28)   300000000       30000000
 7   876 Seager, Corey      Texas        SS       33000000 10 (2022-31)   325000000       32500000
 8   812 Arenado, Nolan     St. Louis    3B       32974847 8 (2019-26)    260000000       32500000
 9   551 Lindor, Francisco  N.Y. Mets    SS       32477277 10 (2022-31)   341000000       34100000
10   938 Strasburg, Stephen Washington   RHP      32205854 7 (2020-26)    245000000       35000000
# ... with 961 more rows
# i Use `print(n = ...)` to see more rows
  • Related