Home > Net >  How to scrape HTML table with nested column with Rvest?
How to scrape HTML table with nested column with Rvest?


I encounter a big problem in scrapping of HTML table with nested columns.

The table is from the enter image description here

I tried to do it with rvest, but the result is messy.


url_data <- "https://www.immd.gov.hk/eng/stat_20220901.html"

url_data %>%
css_selector <- "body > section:nth-child(7) > div > div > div > div > table"
immiTable <- url_data %>% 
read_html() %>% html_element(css = css_selector) %>% html_table()

enter image description here

My goal is to extract the first row (i.e. Airport) and plot it to a pie chart, and produce a dataframe of the whole table and save it to excel.

I realize that teaching material for unnest table and scrapping nested table is rather scarce. Therefore I need your guidance. Thank you very much for your help.

CodePudding user response:

Here is a way. The headers format complicates things but the code below works. It extracts the entire table, not just the first row.


url_data <- "https://www.immd.gov.hk/eng/stat_20220901.html"

page <- url_data %>% read_html()

page %>%
  html_elements("[headers='Arrival']") %>%
  html_text() %>%
  paste("Arrival", .) -> col_names
page %>%
  html_elements("[headers='Departure']") %>%
  html_text() %>%
  paste("Departure", .) %>%
  c(col_names, .) -> col_names
page %>%
  html_elements("[headers='Control_Point']") %>%
  html_text() -> row_names
page %>%
  html_elements("[class='hRight']") %>%
  html_text() %>%
  sub(",", ".", .) %>%
  as.numeric() %>%
  matrix(nrow = length(row_names), byrow = TRUE) %>%
  as.data.frame() %>%
  setNames(col_names) %>%
  `row.names<-`(row_names) -> final

#>                                Arrival Hong Kong Residents
#> Airport                                              4.258
#> Express Rail Link West Kowloon                       0.000
#> Hung Hom                                             0.000
#> Lo Wu                                                0.000
#> Lok Ma Chau Spur Line                                0.000
#> Heung Yuen Wai                                       0.000
#> Hong Kong-Zhuhai-Macao Bridge                      333.000
#> Lok Ma Chau                                          0.000
#> Man Kam To                                           0.000
#> Sha Tau Kok                                          0.000
#> Shenzhen Bay                                         3.404
#> China Ferry Terminal                                 0.000
#> Harbour Control                                      0.000
#> Kai Tak Cruise Terminal                              0.000
#> Macau Ferry Terminal                                 0.000
#> Total                                                7.995
#>                                Arrival Mainland Visitors Arrival Other Visitors
#> Airport                                            1.488                    422
#> Express Rail Link West Kowloon                     0.000                      0
#> Hung Hom                                           0.000                      0
#> Lo Wu                                              0.000                      0
#> Lok Ma Chau Spur Line                              0.000                      0
#> Heung Yuen Wai                                     0.000                      0
#> Hong Kong-Zhuhai-Macao Bridge                     28.000                     39
#> Lok Ma Chau                                        0.000                      0
#> Man Kam To                                         0.000                      0
#> Sha Tau Kok                                        0.000                      0
#> Shenzhen Bay                                     348.000                     37
#> China Ferry Terminal                               0.000                      0
#> Harbour Control                                    0.000                      0
#> Kai Tak Cruise Terminal                            0.000                      0
#> Macau Ferry Terminal                               0.000                      0
#> Total                                              1.864                    498
#>                                Arrival Total Departure Hong Kong Residents
#> Airport                                6.168                         3.775
#> Express Rail Link West Kowloon         0.000                         0.000
#> Hung Hom                               0.000                         0.000
#> Lo Wu                                  0.000                         0.000
#> Lok Ma Chau Spur Line                  0.000                         0.000
#> Heung Yuen Wai                         0.000                         0.000
#> Hong Kong-Zhuhai-Macao Bridge        400.000                       243.000
#> Lok Ma Chau                            0.000                         0.000
#> Man Kam To                             0.000                         0.000
#> Sha Tau Kok                            0.000                         0.000
#> Shenzhen Bay                           3.789                         1.301
#> China Ferry Terminal                   0.000                         0.000
#> Harbour Control                        0.000                         0.000
#> Kai Tak Cruise Terminal                0.000                         0.000
#> Macau Ferry Terminal                   0.000                         0.000
#> Total                                 10.357                         5.319
#>                                Departure Mainland Visitors
#> Airport                                              1.154
#> Express Rail Link West Kowloon                       0.000
#> Hung Hom                                             0.000
#> Lo Wu                                                0.000
#> Lok Ma Chau Spur Line                                0.000
#> Heung Yuen Wai                                       0.000
#> Hong Kong-Zhuhai-Macao Bridge                      194.000
#> Lok Ma Chau                                          0.000
#> Man Kam To                                           0.000
#> Sha Tau Kok                                          0.000
#> Shenzhen Bay                                       524.000
#> China Ferry Terminal                                 0.000
#> Harbour Control                                      0.000
#> Kai Tak Cruise Terminal                              0.000
#> Macau Ferry Terminal                                 0.000
#> Total                                                1.872
#>                                Departure Other Visitors Departure Total
#> Airport                                             315           5.244
#> Express Rail Link West Kowloon                        0           0.000
#> Hung Hom                                              0           0.000
#> Lo Wu                                                 0           0.000
#> Lok Ma Chau Spur Line                                 0           0.000
#> Heung Yuen Wai                                        0           0.000
#> Hong Kong-Zhuhai-Macao Bridge                        15         452.000
#> Lok Ma Chau                                           0           0.000
#> Man Kam To                                            0           0.000
#> Sha Tau Kok                                           0           0.000
#> Shenzhen Bay                                         28           1.853
#> China Ferry Terminal                                  0           0.000
#> Harbour Control                                       0           0.000
#> Kai Tak Cruise Terminal                               0           0.000
#> Macau Ferry Terminal                                  0           0.000
#> Total                                               358           7.549

Created on 2022-09-18 with reprex v2.0.2

To plot the pie chart, first notice the difference in scale between the numbers, some are in the hundreds, others in the units only. Therefore, I plot the logarithm of the values.


Airport <- final[1,,]
Airport %>%
  t() %>%
  as.data.frame() %>%
  mutate(`Arrival/Departure` = row.names(.)) %>%
  ggplot(aes("", log(Airport), fill = `Arrival/Departure`))  
  geom_col(width = 1)  
  coord_polar(theta = "y", start = 0)  

Created on 2022-09-18 with reprex v2.0.2

  • Related