I would like to download the last archive (meteorological data) that has been added to this website by using Rstudio;
https://opendata.dwd.de/climate_environment/CDC/grids_germany/hourly/radolan/recent/asc/
Do you know how to do it? I am able to download one in particular, but then I have to write the exact extension and It should be manually changed every time, and I do not want that, I want it automatically detected.
Thanks.
CodePudding user response:
The function download_CDC()
downloads the files for you. Input number 1
will download the lastest one with their respective name provided by the website.
library(tidyverse)
library(rvest)
base_url <- "https://opendata.dwd.de/climate_environment/CDC/grids_germany/hourly/radolan/recent/asc/"
files <- base_url %>%
read_html() %>%
html_elements("a a") %>%
html_attr("href")
download_CDC <- function(item_number) {
base_url <- "https://opendata.dwd.de/climate_environment/CDC/grids_germany/hourly/radolan/recent/asc/"
download.file(paste0(base_url, files[item_number]),
destfile = files[item_number],
mode = "wb")
}
download_CDC(1)
CodePudding user response:
It's bit naïve (no error checking, blindly takes the last link from the file list page), but works with that particular listing.
Most of web scraping in R happens through rvest , html_element("a:last-of-type")
extracts the last element of type <a>
though CSS selector - your last archive. And html_attr('href')
extracts the href
attribute from that last <a>
-element - actual link to the file.
library(rvest)
last_link <- function(url) {
last_href <- read_html(url) |>
html_element("a:last-of-type") |>
html_attr('href')
paste0(url,last_href)
}
url <- "https://opendata.dwd.de/climate_environment/CDC/grids_germany/hourly/radolan/recent/asc/"
last_link(url)
#> [1] "https://opendata.dwd.de/climate_environment/CDC/grids_germany/hourly/radolan/recent/asc/RW-20220720.tar.gz"
Created on 2022-07-21 by the reprex package (v2.0.1)