Home > Blockchain >  How do I scrape multiple URLs from a dataframe in R?
How do I scrape multiple URLs from a dataframe in R?

Time:12-21

I scraped a website already and made a dataframe from it that only contains one column. The dataframe is called "urldataframe", while the column that contains all of the urls is called "individualrace_url".

Here is some of my data, the links are formatted as a character currently in the dataframe, urldataframe. enter image description here

Here is the first two links: https://sites.google.com/site/hscrrarchive/arca-1/2007-arca-re-max-series-season/2007-arca-200 https://sites.google.com/site/hscrrarchive/arca-1/2007-arca-re-max-series-season/2007-construct-corps-palm-beach-grading-250

How do I create a scraper that goes through my dataframe of links one by one? I'm not sure if a for loop is the way to go about this or not. If I can use a for loop, what am I doing wrong?

res_all <- NULL

for (realtest.Event in racename) {
    urlrunning = paste0(realtest.Event)
    
  scrapinghere = read_html(urlrunning)
    
  putithere <- tibble(
      bubba = scrapinghere %>% html_nodes("#sites-canvas-main-content td:nth-child(2)") %>% html_text(),
      bubba2 = scrapinghere %>% html_node("#sites-canvas-main-content td:nth-child(1)")) %>% html_text()
  
  
  res_all <- bind_rows(res_all, putithere)
  }

I'm hoping that it would go through the loop of each url that I have in the dataframe. Every url has the same nodes, I'm pretty sure my issue is setting up the loop itself.

CodePudding user response:

A for loop is ok, I think in your case the closing parentheses for the tibble are at the wrong place. Another pattern I like is to use purrr::map_dfr which returns a data.frame. Here my untested code as no data is provided:

library(purrr)

res_all <- set_names(racename) %>% 
  map_dfr(function(realtest.Event) {
    scrapinghere = read_html(realtest.Event)
    
    tibble(
      bubba = scrapinghere %>% html_nodes("#sites-canvas-main-content td:nth-child(2)") %>% html_text(),
      bubba2 = scrapinghere %>% html_node("#sites-canvas-main-content td:nth-child(1)") %>% html_text()
    )
  }, .id = "racename")

I've used the .id argument to provide an additional column to the returned data.frame with the value of realtest.Event so that you know to which url the results belong to.

CodePudding user response:

Scraping the tables from the two links without loop.

library(tidyverse)
library(rvest)
library(janitor)

df <- tibble(
  links = c("https://sites.google.com/site/hscrrarchive/arca-1/2007-arca-re-max-series-season/2007-arca-200", 
            "https://sites.google.com/site/hscrrarchive/arca-1/2007-arca-re-max-series-season/2007-construct-corps-palm-beach-grading-250")
)

get_ARCA <- function(link) {
  link %>% 
    read_html() %>% 
    html_table() %>% 
    pluck(4) %>% 
    row_to_names(1) %>% 
    clean_names()
}

map_dfr(df$links, get_ARCA)

# A tibble: 81 × 9
   finish start car_number driver            sponsor                                make      laps  led   status 
   <chr>  <chr> <chr>      <chr>             <chr>                                  <chr>     <chr> <chr> <chr>  
 1 1      2     5          Bobby Gerhart     Lucas Oil                              Chevrolet 80    54    Running
 2 2      20    93         Marc Mitchell     Ergon                                  Pontiac   80    5     Running
 3 3      12    3          Jeremy Clements   Harrison's Work Wear-1 Stop Conv-Saxon Chevrolet 80    0     Running
 4 4      13    39         David Ragan       AAA                                    Ford      80    0     Running
 5 5      3     46         Frank Kimmel      Tri-State Motorsports-Pork             Ford      80    0     Running
 6 6      19    31         Timothy Peters    Cometic Gaskets-Okuma                  Chevrolet 80    0     Running
 7 7      31    16         Justin Allgaier   AG Tech-Trashman-Hoosier Tire Midwest  Chevrolet 80    0     Running
 8 8      14    4          Scott Lagasse Jr. Cunningham Motorsports                 Dodge     80    0     Running
 9 9      11    47         Phillip McGilton  SI Performance-Gould's Electric        Chevrolet 80    0     Running
10 10     17    2          Michael McDowell  Hillcrest Capital Partners             Dodge     80    0     Running
# … with 71 more rows
# ℹ Use `print(n = ...)` to see more rows
  • Related