This time I try to scrape the same type of data -- airport travel figures in any date range. My goal is to obtain a table of airport traffic and plot a line chart on population change in an interval. But I find trouble in iteration.
My code is as follows:
library(rvest)
library(dplyr)
library(tidyverse)
start <- as.Date("01-09-22", format = "%d-%m-%y")
end <- as.Date("30-09-22", format = "%d-%m-%y")
prefixes <- c("arr", "dep")
cols <-
c("Hong Kong Residents",
"Mainland Visitors",
"Other Visitors",
"Total")
headers <-
c("Control_Point", crossing(prefixes, cols) %>% unite("headers", 1:2, remove = T) %>% unlist() %>% unname())
theDate <- start
while (theDate <= end)
{
url_data <-
print(paste0("https://www.immd.gov.hk/eng/stat_", format(theDate, "%Y%m%d"), ".html"
))
rows <-
read_html(url_data) %>% html_elements(".table-passengerTrafficStat tbody tr")
df <- map_dfr(rows,
function(x) {
x %>%
html_elements("td[headers]") %>%
set_names(headers) %>%
html_text()
}) %>%
filter(Control_Point %in% c("Airport")) %>% #select only airport data
mutate(across(c(-1), ~ str_replace(.x, ",", "") %>% as.integer())) %>%
mutate(date = theDate - 1) %>%
write.csv(df, "immigrationStatistics.csv")
theDate <- theDate 1
}
view(df)
May I know why and where the error occurs? And how to fix the iteration method? The console complains that:
[1] "https://www.immd.gov.hk/eng/stat_20220901.html"
Error in file == "" :
comparison (1) is possible only for atomic and list types
> view(df)
Error in checkHT(n, dim(x)) :
invalid 'n' - must contain at least one non-missing element, got none.
Thanks a million in advance.
CodePudding user response:
I was unable to reproduce your error. However, I did made the change of collecting the results of each loop into a list and then writing the information to a file just once. It looks like your original code would overwrite the data file on each iteration.
library(rvest)
library(dplyr)
library(purrr)
library(stringr)
start <- as.Date("01-09-22", format = "%d-%m-%y")
end <- as.Date("3-09-22", format = "%d-%m-%y")
prefixes <- c("arr", "dep")
cols <-
c("Hong Kong Residents",
"Mainland Visitors",
"Other Visitors",
"Total")
headers <-
c("Control_Point", crossing(prefixes, cols) %>% unite("headers", 1:2, remove = T) %>% unlist() %>% unname())
answer <- list()
theDate <- start
while (theDate <= end) {
url_data <-
print(paste0("https://www.immd.gov.hk/eng/stat_", format(theDate, "%Y%m%d"), ".html"
))
rows <-
read_html(url_data) %>% html_elements(".table-passengerTrafficStat tbody tr")
df <- map_dfr(rows,
function(x) {
x %>%
html_elements("td[headers]") %>%
set_names(headers) %>%
html_text()
}) %>%
filter(Control_Point %in% c("Airport")) %>% #select only airport data
mutate(across(c(-1), ~ str_replace(.x, ",", "") %>% as.integer())) %>%
mutate(date = theDate - 1)
answer[[theDate]] <-df
theDate <- theDate 1
Sys.sleep(1)
}
#bind_rows(answer)
write.csv(bind_rows(answer), "immigrationStatistics.csv")
The final change was to add a slight pause as not to appear as an attack.