Home > Software engineering >  Unable to access directory of HTML site using R (RCurl package)
Unable to access directory of HTML site using R (RCurl package)

Time:12-17

I am trying to access the following http directory of weather data using the RCurl package in R:

http://ncei.noaa.gov/data/global-summary-of-the-day/access/

within each directory for each year is a unique list of weather stations.

I can access any specific dataset like this

url = 'http://ncei.noaa.gov/data/global-summary-of-the-day/access/1932/03005099999.csv'
data = read.csv(url)

However, I can't automate this process without knowing what files are within each directory. I've tried using the RCurl package to get a list of all the files within, but always get errors:

url = 'http://ncei.noaa.gov/data/global-summary-of-the-day/access/'
getURL(url)

This gives me the following output saying the address has changed (to an https address)

[1] "<!DOCTYPE HTML PUBLIC \"-//IETF//DTD HTML 2.0//EN\">\n<html><head>\n<title>301 Moved Permanently</title>\n</head><body>\n<h1>Moved Permanently</h1>\n<p>The document has moved <a href=\"https://ncei.noaa.gov/data/global-summary-of-the-day/access/\">here</a>.</p>\n</body></html>\n"

Changing the address to the https url indicated gives this error

url = 'https://ncei.noaa.gov/data/global-summary-of-the-day/access/'
getURL(url)

Error in function (type, msg, asError = TRUE) : error:1407742E:SSL routines:SSL23_GET_SERVER_HELLO:tlsv1 alert protocol version

I tried replacing the https: with ftps: and ftp: Doing so gives me a time out error.

Any thoughts on getting the directory printed?

CodePudding user response:

I think the issue here is that the server only supports requests using TLS version 1.2 and your RCurl does not support it.

You might be able to achieve what you want using httr and rvest. For example, to get a tibble listing the files in the 1929 directory:

library(httr)
library(rvest)

url1 <- "https://www.ncei.noaa.gov/data/global-summary-of-the-day/access/1929"
page_data <- GET(url1)

files <- content(page_data, as = "parsed") %>%
  html_table() %>%
  .[[1]]

files

# A tibble: 24 x 4
   Name               `Last modified`    Size  Description
   <chr>              <chr>              <chr> <lgl>      
 1 ""                 ""                 ""    NA         
 2 "Parent Directory" ""                 "-"   NA         
 3 "03005099999.csv"  "2019-01-19 12:37" "20K" NA         
 4 "03075099999.csv"  "2019-01-19 12:37" "20K" NA         
 5 "03091099999.csv"  "2019-01-19 12:37" "17K" NA         
 6 "03159099999.csv"  "2019-01-19 12:37" "20K" NA         
 7 "03262099999.csv"  "2019-01-19 12:37" "20K" NA         
 8 "03311099999.csv"  "2019-01-19 12:37" "19K" NA         
 9 "03379099999.csv"  "2019-01-19 12:37" "33K" NA         
10 "03396099999.csv"  "2019-01-19 12:37" "21K" NA         
# ... with 14 more rows
  • Related