How do I extracting elements from a complicated, nested .json file in R?-CodePudding

My ultimate goal is to read into R a series of daily .xlsx files (named "Community_Profile_Report_YYYYMMDD_Public.xlsx") from the U.S. HealthData website:

My question is, how do I extract all the assetIDs and the filenames of the associated .xlsx files and get them into a list or dataframe?

Here's the URL of a sample file (one date of many I need), and what I plan to do with it after I read it in. The alphanumeric phrase after "/files/" in the directory path corresponds to the assetID.

url <- "https://healthdata.gov/api/views/gqxm-d9w9/files/3bf3ff9c-67d3-47af-aad3-799e21de6e3d?download=true&filename=Community Profile Report 20210903.xlsx"

df2 <- read.xlsx(url, sheet=6) %>% select(1,2,6,8,16,76,77) %>% .[-1, ]
colnames(df2) <- c("county","FIPS","state","pop","cases_last7","vaccinated","vacc_prop")
cols.name <- c("FIPS","pop","cases_last7","vaccinated","vacc_prop")
df2[cols.name] <- sapply(df2[cols.name], as.numeric)
df2 <- df2 %>% filter(complete.cases(.)) %>% mutate(cases_last7_100k = cases_last7/pop * 100000) %>%   
        mutate(vacc_prop = vacc_prop*100) %>% filter(state != "PR" & vacc_prop < 100)

Thanks, David

CodePudding user response：

One way would be through function parse_json from jsonlite package.

Quick and ugly example for the first element:

require(jsonlite)
require(magrittr)
dt <- read_json(path = 'https://healthdata.gov/resource/6hii-ae4f.json')

# as an example, consider only first item from the list 
dt2 <- dt[[1]]$metadata_published %>% parse_json

# str(dt2) to see the parsed object structure.
# we want to loop through the `attachments`
dt2$attachments %>% lapply(.,function(attachment){
  c(filename = attachment$filename, assetId = attachment$assetId)
}) %>% do.call(rbind,.) -> extracted_list

# filter for `xlsx` entries
subset(extracted_list, grepl('*.xlsx', extracted_list[,'filename']))

For a general case, we could wrap this as a function

'extract_attachments' <- function(x) {
  require(jsonlite)
  require(magrittr)
  x$metadata_published %>% parse_json %>% {
    .$attachments
  } %>%
    lapply(., function(attachment) {
      c(filename = attachment$filename,
        assetId = attachment$assetId)
    }) %>% do.call(rbind, .) %>%
    subset(., grepl('*.xlsx', .[, 'filename']))
}

and loop through the whole json:

lapply(dt,extract_attachments) %>% do.call(rbind,.)