My ultimate goal is to read into R a series of daily .xlsx files (named "Community_Profile_Report_YYYYMMDD_Public.xlsx") from the U.S. HealthData website:
My question is, how do I extract all the assetID
s and the filenames
of the associated .xlsx files and get them into a list or dataframe?
Here's the URL of a sample file (one date of many I need), and what I plan to do with it after I read it in. The alphanumeric phrase after "/files/" in the directory path corresponds to the assetID
.
url <- "https://healthdata.gov/api/views/gqxm-d9w9/files/3bf3ff9c-67d3-47af-aad3-799e21de6e3d?download=true&filename=Community Profile Report 20210903.xlsx"
df2 <- read.xlsx(url, sheet=6) %>% select(1,2,6,8,16,76,77) %>% .[-1, ]
colnames(df2) <- c("county","FIPS","state","pop","cases_last7","vaccinated","vacc_prop")
cols.name <- c("FIPS","pop","cases_last7","vaccinated","vacc_prop")
df2[cols.name] <- sapply(df2[cols.name], as.numeric)
df2 <- df2 %>% filter(complete.cases(.)) %>% mutate(cases_last7_100k = cases_last7/pop * 100000) %>%
mutate(vacc_prop = vacc_prop*100) %>% filter(state != "PR" & vacc_prop < 100)
Thanks, David
CodePudding user response:
One way would be through function parse_json
from jsonlite
package.
Quick and ugly example for the first element:
require(jsonlite)
require(magrittr)
dt <- read_json(path = 'https://healthdata.gov/resource/6hii-ae4f.json')
# as an example, consider only first item from the list
dt2 <- dt[[1]]$metadata_published %>% parse_json
# str(dt2) to see the parsed object structure.
# we want to loop through the `attachments`
dt2$attachments %>% lapply(.,function(attachment){
c(filename = attachment$filename, assetId = attachment$assetId)
}) %>% do.call(rbind,.) -> extracted_list
# filter for `xlsx` entries
subset(extracted_list, grepl('*.xlsx', extracted_list[,'filename']))
For a general case, we could wrap this as a function
'extract_attachments' <- function(x) {
require(jsonlite)
require(magrittr)
x$metadata_published %>% parse_json %>% {
.$attachments
} %>%
lapply(., function(attachment) {
c(filename = attachment$filename,
assetId = attachment$assetId)
}) %>% do.call(rbind, .) %>%
subset(., grepl('*.xlsx', .[, 'filename']))
}
and loop through the whole json
:
lapply(dt,extract_attachments) %>% do.call(rbind,.)