Loading multiple JSON files into R (congressional dates)-CodePudding

I am trying to load 12,880 json files into a dataframe in R but am having some trouble. Any pointers on what I'm doing wrong would be greatly appreciated!

In short, I am trying to see the average age of US politicians over time and have downloaded the data on all Congressional politicians from the library of Congress: https://bioguide.congress.gov/search (you can download the whole database by clicking "download" on the top right).

Once unzipped, it is 12,880 json files (under 70mb).

I have been able to load in some data as lists:

library(tidyverse)
library(jsonlite)
library(magrittr)

path<-"./data/BioguideProfiles" #This is where I am storing the unzipped data
files<-dir(path,pattern="*.json") #This is the list of all the json files
head(files) #Shows me that we have the right list
length(files) #Shows we have right number in list

test<-fromJSON("./data/BioguideProfiles/A000002.json", simplifyDataFrame = TRUE, flatten = TRUE) #loads in 1 file as a list

This shows the immediate issue: R is only recognizing the JSON file as a list and it is not converting it into a dataframe. If I try to force it into a dataframe (using as.data.frame), I get:

Error in (function (..., row.names = NULL, check.rows = FALSE, check.names = TRUE,  : 
  arguments imply differing number of rows: 1, 0, 13, 11

Using code from another StackOverflow post, I tried to load everything in at once:

data<-files %>%
  map_df(~fromJSON(file.path(path, .),flatten = TRUE))

This runs for a bit but them gives me the error:

> data<-files %>%
    map_df(~fromJSON(file.path(path, .),flatten = TRUE))
Error in `stop_vctrs()`:
! Can't recycle `usCongressBioId` (size 0) to match `creativeWork` (size 2).
Run `rlang::last_error()` to see where the error occurred.
> rlang::last_error()
<error/vctrs_error_incompatible_size>
Error in `stop_vctrs()`:
! Can't recycle `usCongressBioId` (size 0) to match `creativeWork` (size 2).
---
Backtrace:
  1. files %>% map_df(~fromJSON(file.path(path, .), flatten = TRUE))
  8. vctrs::stop_incompatible_size(...)
  9. vctrs:::stop_incompatible(...)
 10. vctrs:::stop_vctrs(...)
Run `rlang::last_trace()` to see the full context.
> rlang::last_trace()
<error/vctrs_error_incompatible_size>
Error in `stop_vctrs()`:
! Can't recycle `usCongressBioId` (size 0) to match `creativeWork` (size 2).
---
Backtrace:
     ▆
  1. ├─files %>% map_df(~fromJSON(file.path(path, .), flatten = TRUE))
  2. ├─purrr::map_df(., ~fromJSON(file.path(path, .), flatten = TRUE))
  3. │ └─dplyr::bind_rows(res, .id = .id)
  4. │   └─dplyr:::map(...)
  5. │     └─base::lapply(.x, .f, ...)
  6. │       └─dplyr FUN(X[[i]], ...)
  7. │         └─vctrs::data_frame(!!!.x, .name_repair = "minimal")
  8. └─vctrs::stop_incompatible_size(...)
  9.   └─vctrs:::stop_incompatible(...)
 10.     └─vctrs:::stop_vctrs(...)
 11.       └─rlang::abort(message, class = c(class, "vctrs_error"), ...)

(Apologies & thanks - I am jumping back into R after a few years doing other things so am very rusty...)

CodePudding user response：

A single json file is imported is a nested list with unequal dataframes. Thus it can't be converted to a single dataframe.

Instead you can import all the json files as,

files = dir("./data/BioguideProfiles",pattern="*.json")
files  = paste0("./data/BioguideProfiles", files)
df = lapply(files, fromJSON)

Now you have all the 12,880 json in a single list.

Now if you want to select birthDate from each element you may use

lapply(df, `[[`, 'birthDate')

Further reference in selecting elements in a list https://stackoverflow.com/a/13016620/12135618