How do you download the last file that was added in a directory on internet?-CodePudding

I would like to download the last archive (meteorological data) that has been added to this website by using Rstudio;

https://opendata.dwd.de/climate_environment/CDC/grids_germany/hourly/radolan/recent/asc/

Do you know how to do it? I am able to download one in particular, but then I have to write the exact extension and It should be manually changed every time, and I do not want that, I want it automatically detected.

Thanks.

CodePudding user response：

The function download_CDC() downloads the files for you. Input number 1 will download the lastest one with their respective name provided by the website.

library(tidyverse)
library(rvest)

base_url <- "https://opendata.dwd.de/climate_environment/CDC/grids_germany/hourly/radolan/recent/asc/"

files <- base_url  %>%
  read_html() %>%
  html_elements("a  a") %>%  
  html_attr("href") 


download_CDC <- function(item_number) {
  base_url <- "https://opendata.dwd.de/climate_environment/CDC/grids_germany/hourly/radolan/recent/asc/"
  download.file(paste0(base_url, files[item_number]), 
                destfile = files[item_number],
                mode = "wb")
  
}

download_CDC(1)

CodePudding user response：

It's bit naïve (no error checking, blindly takes the last link from the file list page), but works with that particular listing.

Most of web scraping in R happens through rvest , html_element("a:last-of-type") extracts the last element of type <a> though CSS selector - your last archive. And html_attr('href') extracts the href attribute from that last <a>-element - actual link to the file.

library(rvest)

last_link <- function(url) {
  last_href <- read_html(url) |> 
          html_element("a:last-of-type") |> 
          html_attr('href')
  paste0(url,last_href)
}

url <- "https://opendata.dwd.de/climate_environment/CDC/grids_germany/hourly/radolan/recent/asc/"
last_link(url)
#> [1] "https://opendata.dwd.de/climate_environment/CDC/grids_germany/hourly/radolan/recent/asc/RW-20220720.tar.gz"

^{Created on 2022-07-21 by the reprex package (v2.0.1)}