I am an infectious diseases physician and have set myself the challenge of creating a dataframe with the UK cumulative published cases of monkeypox, so I can graph it as a runing tally or a chloropleth map as there is no nice dashboard at present for this.

All the data is published as html webpages rather than as a nice csv so I am trying to scrape it all off the internet using the rvest package.

Data is only published intermittently (about twice per week) with the cumulative totals for each of the 4 home nations in UK.

I have managed to get working code to pull data from each of the separate webpages and testing it on the first 2 pages in my mpx_gov_uk_pages list works well giving a small example tibble:


# load in overview page url which has links to each date of published cases
mpx_gov_uk_overview_page <- c("https://www.gov.uk/government/publications/monkeypox-outbreak-epidemiological-overview")

# extract urls for each date page
mpx_gov_uk_pages <- mpx_gov_uk_overview_page %>% 
  read_html %>% 
  html_nodes(".govuk-link") %>%
  html_attr('href') %>% 
  str_subset("\\d{1,2}-[a-z] -\\d{4}") %>% 
  paste0("https://www.gov.uk", .) %>% 

# make table for home nations for each date
table1 <- mpx_gov_uk_pages[1] %>% 
  read_html() %>% 
  html_table() %>%
  .[[1]] %>%
  janitor::clean_names() %>% 
  rename(area = starts_with(c("uk", "devolved")),
         cases = matches(c("total", "confirmed_cases"))) %>%
  separate(cases, c("cases", NA), sep = "\\s\\(") %>%
  mutate(date = dmy(str_extract(mpx_gov_uk_pages[1], "\\d{1,2}-[a-z] -\\d{4}")),
         cases = as.numeric(gsub(",", "", cases))) %>%
  select(date, area, cases) %>%
  filter(!area %in% c("Total"))

table2 <- mpx_gov_uk_pages[2] %>% 
  read_html() %>% 
  html_table() %>%
  .[[1]] %>%
  janitor::clean_names() %>% 
  rename(area = starts_with(c("uk", "devolved")),
         cases = matches(c("total", "confirmed_cases"))) %>%
  separate(cases, c("cases", NA), sep = "\\s\\(") %>%
  mutate(date = dmy(str_extract(mpx_gov_uk_pages[2], "\\d{1,2}-[a-z] -\\d{4}")),
         cases = as.numeric(gsub(",", "", cases))) %>%
  select(date, area, cases) %>%
  filter(!area %in% c("Total"))

#> Warning: Expected 2 pieces. Missing pieces filled with `NA` in 1 rows [4].

# Combine tables
bind_rows(table1, table2)
#> # A tibble: 8 × 3
#>   date       area             cases
#>   <date>     <chr>            <dbl>
#> 1 2022-08-02 England           2638
#> 2 2022-08-02 Northern Ireland    24
#> 3 2022-08-02 Scotland            65
#> 4 2022-08-02 Wales               32
#> 5 2022-07-29 England           2436
#> 6 2022-07-29 Northern Ireland    19
#> 7 2022-07-29 Scotland            61
#> 8 2022-07-29 Wales               30

I want to automate this by creating a generic function and passing the list of urls to purrr::map_df as there will be an ever growing number of pages (there's already 13):

pull_first_table <- function(x){
  x %>% 
    read_html() %>% 
    html_table() %>%
    .[[1]] %>%
    janitor::clean_names() %>% 
    rename(area = starts_with(c("uk", "devolved")),
           cases = matches(c("total", "confirmed_cases"))) %>%
    separate(cases, c("cases", NA), sep = "\\s\\(") %>%
    mutate(date = dmy(str_extract({{x}}, "\\d{1,2}-[a-z] -\\d{4}")),
           cases = as.numeric(gsub(",", "", cases))) %>%
    select(date, area, cases) %>%
    filter(!area %in% c("Total"))

summary_table <- map_df(mpx_gov_uk_pages, ~ pull_first_table)

Error in `dplyr::bind_rows()`:
! Argument 1 must be a data frame or a named atomic vector.
Run `rlang::last_error()` to see where the error occurred.

The generic function seems to work ok when I supply it with a single element e.g. mpx_gov_uk_cases[2] but I cannot seem to get map_df to work properly even though the webscraping is producing tibbles.

All help and pointers greatly welcomed.

We just need the function and not a lambda expression.

map_dfr(mpx_gov_uk_pages, pull_first_table)


# A tibble: 52 × 3
   date       area             cases
   <date>     <chr>            <dbl>
 1 2022-08-02 England           2638
 2 2022-08-02 Northern Ireland    24
 3 2022-08-02 Scotland            65
 4 2022-08-02 Wales               32
 5 2022-07-29 England           2436
 6 2022-07-29 Northern Ireland    19
 7 2022-07-29 Scotland            61
 8 2022-07-29 Wales               30
 9 2022-07-26 England           2325
10 2022-07-26 Northern Ireland    18
# … with 42 more rows

If we use the lambda expression,

map_dfr(mpx_gov_uk_pages, ~ pull_first_table(.x))
