How to write rscript to to extract URL from HTML table-CodePudding

I'm trying to extract every URL like "https://....zip" from the element <a href=""> of the page: https://divvy-tripdata.s3.amazonaws.com/index.html using rvest library as follows:

link <- "https://divvy-tripdata.s3.amazonaws.com/index.html"

library(rvest) library(xml2)

html <- read_html(link)

html %>% html_attrs("href")

Output:

html %>% html_attrs("href") Error in html_attrs(., "href") : unused argument ("href")

Can you please help me using R to extract all URL from the above link?

HTML: https://i.stack.imgur.com/5BiFU.jpg

CodePudding user response：

Base R solution, using the url back one level to read and parse the xml:

# Store as a variable the path url to be scrapped: base_url => character scalar
base_url <- "https://divvy-tripdata.s3.amazonaws.com"

# Resolve the zip urls: zip_urls => character vector
zip_urls <- paste(
  base_url, 
  gsub(
    ">(.*?)<\\/",
    "\\1",
    grep(
      "\\.zip", 
      strsplit(
        readLines(base_url), 
        "\\<Key\\>")[[2]],
      value = TRUE
    )
  ),
  sep = "/"
)

CodePudding user response：

The links are coming from an additional GET request made by the browser which returns xml. You can still go with rvest and grab the Key nodes then complete the urls.

library(rvest)

base_url <- "https://divvy-tripdata.s3.amazonaws.com"
files <- read_html(base_url) |> html_elements('key') |> html_text() |> url_absolute(base_url)

For older R versions, swop |> with %>% and add library(magrittr) as import.