Home > OS >  trouble scraping html table data in an interval with rvest
trouble scraping html table data in an interval with rvest

Time:10-02

Two weeks ago I enter image description here

This time I try to scrape the same type of data -- airport travel figures in any date range. My goal is to obtain a table of airport traffic and plot a line chart on population change in an interval. But I find trouble in iteration.

My code is as follows:

library(rvest)
library(dplyr)
library(tidyverse)


start <- as.Date("01-09-22", format = "%d-%m-%y")
end   <- as.Date("30-09-22", format = "%d-%m-%y")


prefixes <- c("arr", "dep")
cols <-
  c("Hong Kong Residents",
    "Mainland Visitors",
    "Other Visitors",
    "Total")
headers <-
  c("Control_Point", crossing(prefixes, cols) %>% unite("headers", 1:2, remove = T) %>% unlist() %>% unname())


theDate <- start
while (theDate <= end)
{
  url_data <-
    print(paste0("https://www.immd.gov.hk/eng/stat_", format(theDate, "%Y%m%d"), ".html"
    ))
  
  rows <-
    read_html(url_data) %>% html_elements(".table-passengerTrafficStat tbody tr")
 
  df <- map_dfr(rows,
                function(x) {
                  x %>%
                    html_elements("td[headers]") %>%
                    set_names(headers) %>%
                    html_text()
                }) %>%
    filter(Control_Point %in% c("Airport")) %>% #select only airport data
    mutate(across(c(-1), ~ str_replace(.x, ",", "") %>% as.integer())) %>%
    mutate(date = theDate - 1) %>%
    write.csv(df, "immigrationStatistics.csv")
  
  theDate <- theDate   1
}
view(df)

May I know why and where the error occurs? And how to fix the iteration method? The console complains that:

[1] "https://www.immd.gov.hk/eng/stat_20220901.html"
Error in file == "" : 
  comparison (1) is possible only for atomic and list types
> view(df)
Error in checkHT(n, dim(x)) : 
  invalid 'n' -  must contain at least one non-missing element, got none.

Thanks a million in advance.

CodePudding user response:

I was unable to reproduce your error. However, I did made the change of collecting the results of each loop into a list and then writing the information to a file just once. It looks like your original code would overwrite the data file on each iteration.

library(rvest)
library(dplyr)
library(purrr)
library(stringr)

start <- as.Date("01-09-22", format = "%d-%m-%y")
end   <- as.Date("3-09-22", format = "%d-%m-%y")

prefixes <- c("arr", "dep")
cols <-
   c("Hong Kong Residents",
     "Mainland Visitors",
     "Other Visitors",
     "Total")
headers <-
   c("Control_Point", crossing(prefixes, cols) %>% unite("headers", 1:2, remove = T) %>% unlist() %>% unname())

answer <- list()
theDate <- start
while (theDate <= end) {
   url_data <-
      print(paste0("https://www.immd.gov.hk/eng/stat_", format(theDate, "%Y%m%d"), ".html"
      ))
   
   rows <-
      read_html(url_data) %>% html_elements(".table-passengerTrafficStat tbody tr")
   
   df <- map_dfr(rows,
                 function(x) {
                    x %>%
                       html_elements("td[headers]") %>%
                       set_names(headers) %>%
                       html_text()
                 })  %>%
      filter(Control_Point %in% c("Airport")) %>% #select only airport data
      mutate(across(c(-1), ~ str_replace(.x, ",", "") %>% as.integer())) %>%
      mutate(date = theDate - 1)         
   answer[[theDate]] <-df
      
   theDate <- theDate   1
   Sys.sleep(1)
}
#bind_rows(answer)
write.csv(bind_rows(answer), "immigrationStatistics.csv")

The final change was to add a slight pause as not to appear as an attack.

  • Related