I am trying to load 12,880 json files into a dataframe in R but am having some trouble. Any pointers on what I'm doing wrong would be greatly appreciated!
In short, I am trying to see the average age of US politicians over time and have downloaded the data on all Congressional politicians from the library of Congress: https://bioguide.congress.gov/search (you can download the whole database by clicking "download" on the top right).
Once unzipped, it is 12,880 json files (under 70mb).
I have been able to load in some data as lists:
library(tidyverse)
library(jsonlite)
library(magrittr)
path<-"./data/BioguideProfiles" #This is where I am storing the unzipped data
files<-dir(path,pattern="*.json") #This is the list of all the json files
head(files) #Shows me that we have the right list
length(files) #Shows we have right number in list
test<-fromJSON("./data/BioguideProfiles/A000002.json", simplifyDataFrame = TRUE, flatten = TRUE) #loads in 1 file as a list
This shows the immediate issue: R is only recognizing the JSON file as a list and it is not converting it into a dataframe. If I try to force it into a dataframe (using as.data.frame
), I get:
Error in (function (..., row.names = NULL, check.rows = FALSE, check.names = TRUE, :
arguments imply differing number of rows: 1, 0, 13, 11
Using code from another StackOverflow post, I tried to load everything in at once:
data<-files %>%
map_df(~fromJSON(file.path(path, .),flatten = TRUE))
This runs for a bit but them gives me the error:
> data<-files %>%
map_df(~fromJSON(file.path(path, .),flatten = TRUE))
Error in `stop_vctrs()`:
! Can't recycle `usCongressBioId` (size 0) to match `creativeWork` (size 2).
Run `rlang::last_error()` to see where the error occurred.
> rlang::last_error()
<error/vctrs_error_incompatible_size>
Error in `stop_vctrs()`:
! Can't recycle `usCongressBioId` (size 0) to match `creativeWork` (size 2).
---
Backtrace:
1. files %>% map_df(~fromJSON(file.path(path, .), flatten = TRUE))
8. vctrs::stop_incompatible_size(...)
9. vctrs:::stop_incompatible(...)
10. vctrs:::stop_vctrs(...)
Run `rlang::last_trace()` to see the full context.
> rlang::last_trace()
<error/vctrs_error_incompatible_size>
Error in `stop_vctrs()`:
! Can't recycle `usCongressBioId` (size 0) to match `creativeWork` (size 2).
---
Backtrace:
▆
1. ├─files %>% map_df(~fromJSON(file.path(path, .), flatten = TRUE))
2. ├─purrr::map_df(., ~fromJSON(file.path(path, .), flatten = TRUE))
3. │ └─dplyr::bind_rows(res, .id = .id)
4. │ └─dplyr:::map(...)
5. │ └─base::lapply(.x, .f, ...)
6. │ └─dplyr FUN(X[[i]], ...)
7. │ └─vctrs::data_frame(!!!.x, .name_repair = "minimal")
8. └─vctrs::stop_incompatible_size(...)
9. └─vctrs:::stop_incompatible(...)
10. └─vctrs:::stop_vctrs(...)
11. └─rlang::abort(message, class = c(class, "vctrs_error"), ...)
(Apologies & thanks - I am jumping back into R after a few years doing other things so am very rusty...)
CodePudding user response:
A single json
file is imported is a nested list with unequal dataframes. Thus it can't be converted to a single dataframe.
Instead you can import all the json
files as,
files = dir("./data/BioguideProfiles",pattern="*.json")
files = paste0("./data/BioguideProfiles", files)
df = lapply(files, fromJSON)
Now you have all the 12,880 json
in a single list.
Now if you want to select birthDate
from each element you may use
lapply(df, `[[`, 'birthDate')
Further reference in selecting elements in a list https://stackoverflow.com/a/13016620/12135618