Home > database >  Import HTML tables from similar URLs using R Studio
Import HTML tables from similar URLs using R Studio

Time:10-08

I'm new to R Studio and biting off more than I can chew :)

I'm trying to use R Studio to import HTML tables from multiple web pages. There are thousands of tables, each with a unique URL, but the URLs all follow the same logic, only varying by location search string, year, and month. Below are a few examples:

https://sunrise-sunset.org/search?location=mar del plata&year=2021&month=10#calendar

https://sunrise-sunset.org/search?location=bendigo victoria&year=1969&month=7#calendar

https://sunrise-sunset.org/search?location=parkville missouri usa&year=2025&month=2#calendar

I've tried using paste0() and c() to write a series [of URLs] that I can then use to import the data to R Studio:

URLs <- paste0("https://sunrise-sunset.org/search?location=parkville missouri usa&year=",c(1969:2031),"&month=",c(1:12),"#calendar")

However, by using two separate instances of c() for Year and Month, the sequences generated are independent of one another and I end up with 60 different URLs instead of 750. Is there a way to generate a series of Months, 1:12, for each year in a series of Years, 1969:2031, using paste0()? Is this even the best approach for what I'm trying to accomplish? And if so, is there also a way to generate this series of Years and Months for multiple locations as well?

CodePudding user response:

One option using expand.grid to create a dataframe and apply to collapse all rows into a single string.

base_url <- 'https://sunrise-sunset.org/search?location=parkville missouri usa&'

year_url <- paste0("year=",c(1969:2031))

mon_url <- paste0("&month=",c(1:12),"#calendar")

out_url <- apply(expand.grid(base_url, year_url, mon_url), 1, paste, collapse = '')

length(out_url)
#> [1] 756

head(out_url)
#> [1] "https://sunrise-sunset.org/search?location=parkville missouri usa&year=1969&month=1#calendar"
#> [2] "https://sunrise-sunset.org/search?location=parkville missouri usa&year=1970&month=1#calendar"
#> [3] "https://sunrise-sunset.org/search?location=parkville missouri usa&year=1971&month=1#calendar"
#> [4] "https://sunrise-sunset.org/search?location=parkville missouri usa&year=1972&month=1#calendar"
#> [5] "https://sunrise-sunset.org/search?location=parkville missouri usa&year=1973&month=1#calendar"
#> [6] "https://sunrise-sunset.org/search?location=parkville missouri usa&year=1974&month=1#calendar"

Created on 2021-10-07 by the reprex package (v2.0.0)


Or a different option using rep to repeat the shorter vectors (base_url and mon_url) the same number of times in the longest vector (year_url)

paste0(rep(base_url, each = length(year_url)), 
       year_url, 
       rep(mon_url, each = length(year_url)))
  • Related