I encounter a big problem in scrapping of HTML table with nested columns.
I tried to do it with rvest, but the result is messy.
library(rvest)
library(tidyverse)
library(stringr)
library(dplyr)
url_data <- "https://www.immd.gov.hk/eng/stat_20220901.html"
url_data %>%
read_html()
css_selector <- "body > section:nth-child(7) > div > div > div > div > table"
immiTable <- url_data %>%
read_html() %>% html_element(css = css_selector) %>% html_table()
immiTable
My goal is to extract the first row (i.e. Airport) and plot it to a pie chart, and produce a dataframe of the whole table and save it to excel.
I realize that teaching material for unnest table and scrapping nested table is rather scarce. Therefore I need your guidance. Thank you very much for your help.
CodePudding user response:
Here is a way. The headers format complicates things but the code below works. It extracts the entire table, not just the first row.
suppressPackageStartupMessages({
library(rvest)
library(dplyr)
})
url_data <- "https://www.immd.gov.hk/eng/stat_20220901.html"
page <- url_data %>% read_html()
page %>%
html_elements("[headers='Arrival']") %>%
html_text() %>%
paste("Arrival", .) -> col_names
page %>%
html_elements("[headers='Departure']") %>%
html_text() %>%
paste("Departure", .) %>%
c(col_names, .) -> col_names
page %>%
html_elements("[headers='Control_Point']") %>%
html_text() -> row_names
page %>%
html_elements("[class='hRight']") %>%
html_text() %>%
sub(",", ".", .) %>%
as.numeric() %>%
matrix(nrow = length(row_names), byrow = TRUE) %>%
as.data.frame() %>%
setNames(col_names) %>%
`row.names<-`(row_names) -> final
final
#> Arrival Hong Kong Residents
#> Airport 4.258
#> Express Rail Link West Kowloon 0.000
#> Hung Hom 0.000
#> Lo Wu 0.000
#> Lok Ma Chau Spur Line 0.000
#> Heung Yuen Wai 0.000
#> Hong Kong-Zhuhai-Macao Bridge 333.000
#> Lok Ma Chau 0.000
#> Man Kam To 0.000
#> Sha Tau Kok 0.000
#> Shenzhen Bay 3.404
#> China Ferry Terminal 0.000
#> Harbour Control 0.000
#> Kai Tak Cruise Terminal 0.000
#> Macau Ferry Terminal 0.000
#> Total 7.995
#> Arrival Mainland Visitors Arrival Other Visitors
#> Airport 1.488 422
#> Express Rail Link West Kowloon 0.000 0
#> Hung Hom 0.000 0
#> Lo Wu 0.000 0
#> Lok Ma Chau Spur Line 0.000 0
#> Heung Yuen Wai 0.000 0
#> Hong Kong-Zhuhai-Macao Bridge 28.000 39
#> Lok Ma Chau 0.000 0
#> Man Kam To 0.000 0
#> Sha Tau Kok 0.000 0
#> Shenzhen Bay 348.000 37
#> China Ferry Terminal 0.000 0
#> Harbour Control 0.000 0
#> Kai Tak Cruise Terminal 0.000 0
#> Macau Ferry Terminal 0.000 0
#> Total 1.864 498
#> Arrival Total Departure Hong Kong Residents
#> Airport 6.168 3.775
#> Express Rail Link West Kowloon 0.000 0.000
#> Hung Hom 0.000 0.000
#> Lo Wu 0.000 0.000
#> Lok Ma Chau Spur Line 0.000 0.000
#> Heung Yuen Wai 0.000 0.000
#> Hong Kong-Zhuhai-Macao Bridge 400.000 243.000
#> Lok Ma Chau 0.000 0.000
#> Man Kam To 0.000 0.000
#> Sha Tau Kok 0.000 0.000
#> Shenzhen Bay 3.789 1.301
#> China Ferry Terminal 0.000 0.000
#> Harbour Control 0.000 0.000
#> Kai Tak Cruise Terminal 0.000 0.000
#> Macau Ferry Terminal 0.000 0.000
#> Total 10.357 5.319
#> Departure Mainland Visitors
#> Airport 1.154
#> Express Rail Link West Kowloon 0.000
#> Hung Hom 0.000
#> Lo Wu 0.000
#> Lok Ma Chau Spur Line 0.000
#> Heung Yuen Wai 0.000
#> Hong Kong-Zhuhai-Macao Bridge 194.000
#> Lok Ma Chau 0.000
#> Man Kam To 0.000
#> Sha Tau Kok 0.000
#> Shenzhen Bay 524.000
#> China Ferry Terminal 0.000
#> Harbour Control 0.000
#> Kai Tak Cruise Terminal 0.000
#> Macau Ferry Terminal 0.000
#> Total 1.872
#> Departure Other Visitors Departure Total
#> Airport 315 5.244
#> Express Rail Link West Kowloon 0 0.000
#> Hung Hom 0 0.000
#> Lo Wu 0 0.000
#> Lok Ma Chau Spur Line 0 0.000
#> Heung Yuen Wai 0 0.000
#> Hong Kong-Zhuhai-Macao Bridge 15 452.000
#> Lok Ma Chau 0 0.000
#> Man Kam To 0 0.000
#> Sha Tau Kok 0 0.000
#> Shenzhen Bay 28 1.853
#> China Ferry Terminal 0 0.000
#> Harbour Control 0 0.000
#> Kai Tak Cruise Terminal 0 0.000
#> Macau Ferry Terminal 0 0.000
#> Total 358 7.549
Created on 2022-09-18 with reprex v2.0.2
To plot the pie chart, first notice the difference in scale between the numbers, some are in the hundreds, others in the units only. Therefore, I plot the logarithm of the values.
library(ggplot2)
Airport <- final[1,,]
Airport %>%
t() %>%
as.data.frame() %>%
mutate(`Arrival/Departure` = row.names(.)) %>%
ggplot(aes("", log(Airport), fill = `Arrival/Departure`))
geom_col(width = 1)
coord_polar(theta = "y", start = 0)
theme_void()
Created on 2022-09-18 with reprex v2.0.2