Webscraping images in r and saving them into a zip file-CodePudding

I am trying to webscrape information from this website: https://www.nea.gov.sg/weather/rain-areas and download the 240km radar scans between 2022-07-31 01:00:00 (am) and 2022-07-31 03:00:00 (am) at five-minute intervals, inclusive of end points. Save the images to a zip file.

Edit: Is there a way to do it with just rvest and avoiding the usage of for loops?

I've fount out that the image address can be acquired by clicking on the image and selecting copy image address. An example :https://www.nea.gov.sg/docs/default-source/rain-area-240km/dpsri_240km_2022091920000000dBR.dpsri.png

I've noted that the string of numbers would represent the date and time. So the one I'd need would be 20220731xxxxxxx where x would be the time. However, how would I then use this to webscrape?

Could someone provide some guidance? I can't even seem to find the radar scans for that day. Thank you.

CodePudding user response：

You can consider the following code to save the screenshots of the webpage :

library(RSelenium)
url <- "https://www.nea.gov.sg/weather/rain-areas"
shell('docker run -d -p 4445:4444 selenium/standalone-firefox')
remDr <- remoteDriver(remoteServerAddr = "localhost", port = 4445L, browserName = "firefox")
remDr$open()
remDr$navigate(url)

web_Elem <- remDr$findElement("xpath", '//*[@id="rain-area-slider"]/div/button')
web_Elem$clickElement()

for(i in 1 : 10)
{
  print(i)
  Sys.sleep(1)
  path_To_File <- paste0("C:/file", i, ".png")
  remDr$screenshot(display = FALSE, useViewer = TRUE, file = path_To_File)
}

CodePudding user response：

Scraping the images from the website requires you to interact with the website (e.g. clicks), so we will use the RSelenium package for the task. You will also need to have Firefox installed on your system to be able to follow this solution.

1. Load packages

We will begin by loading our packages of interest:

# Load packages ----
pacman::p_load(
  httr,
  png,
  purrr,
  RSelenium,
  rvest,
  servr
)

2. Setup

Now, we need to start the Selenium server with firefox. The following code will start a firefox instance. Run it and wait for firefox to launch:

# Start Selenium driver with firefox ----
rsd <- rsDriver(browser = "firefox", port = random_port())

Now that the firefox browser (aka the client) is up, we want to be able to manipulate it with our code. So, let's create a variable (cl for client) that will represent it. We will use the variable to perform all the actions we need:

cl <- rsd$client

The first action we want to perform is to navigate to the website. Once you run the code, notice how Firefox goes to the website as a response to you running your R code:

# Navigate to the webpage ----
cl$navigate(url = "https://www.nea.gov.sg/weather/rain-areas")

Let's get scraping

Now we're going to begin the actual scraping! @EmmanuelHamel took the clever approach of simply clicking on the "play" button in order to launch the automatic "slideshow". He then took a screenshot of the webpage every second in order to capture the changes in the image. The approach I use is somewhat different.

In the code below, I identify the 13 steps of the slideshow (along the horizontal green bar) and I click on each "step" one after the other. After clicking on a step, I get the URL of the image, then I click on the other step... all the way to the 13th step.

Here I get the HTML element for each step:

# Get the selector for each of the 13 steps
rail_steps <- cl$findElements(using = "css", value = "div.vue-slider-mark")[1:13]

Then, I click on each element and get the image URL at each step. After you run this code, check how your code manipulates the webpage on the firefox instance, isn't that cool?

img_urls <- map_chr(rail_steps, function(step){
    
  cl$mouseMoveToLocation(webElement = step)
  cl$click()
  
  img_el <- cl$findElement(using = "css", value = "#rain_overlay")
  
  Sys.sleep(1)
  
  imcg_url <- 
    img_el$getElementAttribute(attrName = "src")[[1]]
  
})

Finally, I create an image folder img where I download and save the images:

# Create an image folder then download all images in it ----

dir.create("img")

walk(img_urls, function(img_url){
  GET(url = img_url) |>
    content() |>
    writePNG(target = paste0("img/", basename(img_url)))
})

Important

The downloaded images do not contain the background map on the webpage... only the points! You can download the background map then lay the points on top of it (using an image processing software for example). Here is how to download the background map:

# Download the background map----
GET(url = "https://www.nea.gov.sg/assets/images/map/base-853.png") |>
  content() |>
  writePNG(target = "base_image.png")

If you want to combine the images programmatically, you may want to look into the magick package in R.