Web-scraping using "rvest" package, how do I loop over a bunch of county FIPS codes?-CodePudding

So I'm trying to acquire farm subsidy data from a website, I've figured out how to scrape what I'm looking for, now I'm trying to loop over all counties in a state (CO) to acquire these subsidy data by year from all counties. I'm okay with either (a) having an individual .csv for each county after the loop runs or (b) having them all compiled into one data frame then saved as a .csv.

Below is an example of the scraping completed for just one county. Now I'd like to write a loop that goes through the fips codes 08003, 08005, 08007, 08009, and 08011 for now (I can extrapolate this to other counties in CO later).

# Starting with Adams County
library(rvest)
library(dplyr)
library(tidyr)


link = "https://farm.ewg.org/regionsummary.php?fips=08001"
page = read_html(link)

year = page %>% html_nodes("tr~ tr  tr td:nth-child(1)") %>% html_text()
year

subs = page %>% html_nodes("td:nth-child(3)") %>% html_text()
subs

  subsidy_data <- data.frame(subs)

  subs = data.frame(do.call("rbind", strsplit(as.character(subsidy_data$subs), "$", fixed = TRUE)))
  
  sub_data <- cbind(year, subs)
  sub_data <- sub_data[-c(28),]

cons_sub_rec = page %>% html_nodes("td~ td  td small:nth-child(1) em") %>% html_text()
cons_sub_rec <- cons_sub_rec[-c(28)]

dis_sub_rec = page %>% html_nodes("small:nth-child(3) em") %>% html_text()

comm_sub_rec = page %>% html_nodes("small:nth-child(5) em") %>% html_text()

ins_sub_rec = page %>% html_nodes("small:nth-child(7) em") %>% html_text()

  sub_data <- cbind(year, subs, cons_sub_rec, dis_sub_rec, comm_sub_rec, ins_sub_rec)
  
  sub_data$fips = 8001
  
  write.csv(sub_data,"filepath/ewg_sub_8001.csv", row.names = TRUE)

Any and all suggestions are welcomed!

CodePudding user response：

Your code as shown doesn't work, so I made the quickest corrections that made sense to me in the below.

The idea here is to capture the steps that you want to loop over in a custom function, where the variable going in is whatever you want to loop. Then use purrr::map() to map over the fips codes with this function.

library(rvest)

fips <- c(08001, 08003, 08005, 08007, 08009,  08011)

get_fips_data <- function(x) {
  url <- paste("https://farm.ewg.org/regionsummary.php?fips=", x)
  site <- read_html(url)
  
  year = site %>% html_nodes("tr~ tr  tr td:nth-child(1)") %>% html_text()

  subs = site %>% html_nodes("td:nth-child(3)") %>% html_text()

  subsidy_data <- data.frame(subs)

  subs = data.frame(do.call("rbind", strsplit(as.character(subsidy_data$subs), "$", fixed = TRUE)))

  sub_data <- cbind(year, subs)
  sub_data <- sub_data[-28,]

  cons_sub_rec = site %>% html_nodes("small:nth-child(1) em") %>% html_text()
  cons_sub_rec <- cons_sub_rec[-28]

  dis_sub_rec = site %>% html_nodes("small:nth-child(3) em") %>% html_text()

  comm_sub_rec = site %>% html_nodes("small:nth-child(5) em") %>% html_text()

  ins_sub_rec = site %>% html_nodes("small:nth-child(7) em") %>% html_text()

  cbind(year, subs, cons_sub_rec, dis_sub_rec, comm_sub_rec, ins_sub_rec)
}


fips %>% 
  purrr::set_names() %>% 
  purrr::map_dfr(get_fips_data, .id = "fips")

CodePudding user response：

You could make a request to the geo file to first collect all the county codes and names associated with a given state. This can be done with a helper function. You can then write an additional helper function to tidy the html, returned from each request to a given webpage (where the url is constructed from a base string joined with county id/code), into a single DataFrame containing the info of interest. Map that latter function with future_map_dfr from furrr to return a single DataFrame.

Notes:

Code is written with R 4.1.0 syntax, as a learning point for me, but I can provide earlier version code.

Credit to @hrbrmstr for approach to handling br elements.

library(rvest)
library(tidyverse)
library(jsonlite)
library(janitor)

state_county_codes <- \(state_code){
  read_html(sprintf("https://farm.ewg.org/ammap/maps/js/%sCounties.js", state_code)) |>
    html_text() |>
    stringr::str_match("(\\[.*\\])") |>
    {
      \(x) x[, 1]
    }() |>
    jsonlite::parse_json(simplifyVector = T) |>
    select(-d) |>
    mutate(
      id = substr(id, 2, 6),
      webpage = paste0("https://farm.ewg.org/regionsummary.php?fips=", id)
    ) |>
    tibble() -> df
}

county_summary <- function(county_code) {
  page <- read_html(sprintf("https://farm.ewg.org/regionsummary.php?fips=%s", county_code))

  xml_find_all(page, ".//br") %>% xml_add_sibling("p", "#")
  xml_find_all(page, ".//br") %>% xml_remove()

  t <- page |>
    html_element(".table") |>
    html_table()

  t <- t[-c(5)] |> clean_names()

  df <- data.frame(
    id = county_code,
    year = t$year |> stringi::stri_remove_empty() |> rep(4) |>
      {
        \(x) stringr::str_replace(x, "‡", "")
      }(),
    `subsidy_category` = stringr::str_split_fixed(t$`subsidy_category`, "#", 4) |> stringi::stri_remove_empty() |> as.vector(),
    amount = stringr::str_split_fixed(t$`subsidy_category_2`, "#", 4) |> stringi::stri_remove_empty() |> as.vector(),
    number = stringr::str_split_fixed(t$`subsidy_category_3`, "#", 4) |> stringi::stri_remove_empty() |> as.vector()
  )
}

state_code <- "co"

counties <- state_county_codes(state_code)

no_cores <- future::availableCores() - 1
future::plan(future::multisession, workers = no_cores)
results <- future_map_dfr(counties$id, .f = county_summary)

final <- dplyr::left_join(results, counties, by = "id") |>
  select(title, everything()) |>
  rename(county = title)