I face a challenge of binding a nested list into a dataframe for processing.
Edit: here below is the example of original nested data before I tried to flat them.
list(list(list(url = "https://lda.senate.gov/api/v1/filings/5e4bbd96-db94-4ea3-a310-7a7fb1e93fff/",
filing_uuid = "5e4bbd96-db94-4ea3-a310-7a7fb1e93fff", filing_type = "Q1",
filing_type_display = "1st Quarter - Report", filing_year = 2021L,
filing_period = "first_quarter", filing_period_display = "1st Quarter (Jan 1 - Mar 31)",
filing_document_url = "https://lda.senate.gov/filings/public/filing/5e4bbd96-db94-4ea3-a310-7a7fb1e93fff/print/",
filing_document_content_type = "text/html", income = "15000.00",
expenses = NULL, expenses_method = NULL, expenses_method_display = NULL,
posted_by_name = "Christian Smith", dt_posted = "2021-04-30T10:20:59.217000-04:00",
termination_date = NULL, registrant = list(id = 8214L, url = "https://lda.senate.gov/api/v1/registrants/8214/",
house_registrant_id = 31113L, name = "CAPSTONE NATIONAL PARTNERS",
description = "public affairs", address_1 = "501 Capitol Court NE",
address_2 = "Suite 100", address_3 = NULL, address_4 = NULL,
city = "Washington", state = "DC", state_display = "District of Columbia",
zip = "20002", country = "US", country_display = "United States of America",
ppb_country = "US", ppb_country_display = "United States of America",
contact_name = "", contact_telephone = "", dt_updated = "2022-01-13T14:47:31.828778-05:00"),
client = list(id = 111342L, url = "https://lda.senate.gov/api/v1/clients/111342/",
client_id = 303L, name = "OSHKOSH CORPORATION", general_description = "manufacturing",
client_government_entity = FALSE, client_self_select = NULL,
state = "WI", state_display = "Wisconsin", country = "US",
country_display = "United States of America", ppb_state = "WI",
ppb_state_display = "Wisconsin", ppb_country = "US",
ppb_country_display = "United States of America", effective_date = "2016-04-01"),
lobbying_activities = list(list(general_issue_code = "BUD",
general_issue_code_display = "Budget/Appropriations",
description = "FY22 Appropriations", foreign_entity_issues = "",
lobbyists = list(list(lobbyist = list(id = 63767L, prefix = NULL,
prefix_display = NULL, first_name = "WILLIAM", nickname = NULL,
middle_name = NULL, last_name = "STONE", suffix = NULL,
suffix_display = NULL), covered_position = "Chief of Staff, Dave Obey: House Appropriations Committee",
new = FALSE)), government_entities = list(list(id = 2L,
name = "HOUSE OF REPRESENTATIVES"), list(id = 1L,
name = "SENATE")))), conviction_disclosures = list(),
foreign_entities = list(), affiliated_organizations = list())),
list(list(url = "https://lda.senate.gov/api/v1/filings/177b995a-3be2-4127-b962-795e76974617/",
filing_uuid = "177b995a-3be2-4127-b962-795e76974617",
filing_type = "Q1", filing_type_display = "1st Quarter - Report",
filing_year = 2021L, filing_period = "first_quarter",
filing_period_display = "1st Quarter (Jan 1 - Mar 31)",
filing_document_url = "https://lda.senate.gov/filings/public/filing/177b995a-3be2-4127-b962-795e76974617/print/",
filing_document_content_type = "text/html", income = "22500.00",
expenses = NULL, expenses_method = NULL, expenses_method_display = NULL,
posted_by_name = "Doyce Boesch", dt_posted = "2021-04-30T11:22:12.233000-04:00",
termination_date = NULL, registrant = list(id = 400677020L,
url = "https://lda.senate.gov/api/v1/registrants/400677020/",
house_registrant_id = NULL, name = "MR. DOYCE BOESCH",
description = "Government Relations", address_1 = "4515 W Street NW",
address_2 = NULL, address_3 = NULL, address_4 = NULL,
city = "Washington", state = "DC", state_display = "District of Columbia",
zip = "20007", country = "US", country_display = "United States of America",
ppb_country = "US", ppb_country_display = "United States of America",
contact_name = "DOYCE BOESCH", contact_telephone = " 1 202-731-9995",
dt_updated = "2022-01-13T14:59:12.412096-05:00"),
client = list(id = 194057L, url = "https://lda.senate.gov/api/v1/clients/194057/",
client_id = 75L, name = "INVESTMENT COMPANY INSTITUTE",
general_description = "Stock Market and Financial Services",
client_government_entity = FALSE, client_self_select = FALSE,
state = "DC", state_display = "District of Columbia",
country = "US", country_display = "United States of America",
ppb_state = NULL, ppb_state_display = NULL, ppb_country = "US",
ppb_country_display = "United States of America",
effective_date = "2012-07-01"), lobbying_activities = list(
list(general_issue_code = "FIN", general_issue_code_display = "Financial Institutions/Investments/Securities",
description = "providing awareness of members positions",
foreign_entity_issues = "", lobbyists = list(
list(lobbyist = list(id = 52828L, prefix = NULL,
prefix_display = NULL, first_name = "DOYCE",
nickname = NULL, middle_name = NULL, last_name = "BOESCH",
suffix = NULL, suffix_display = NULL), covered_position = NULL,
new = FALSE)), government_entities = list(
list(id = 2L, name = "HOUSE OF REPRESENTATIVES"),
list(id = 1L, name = "SENATE")))), conviction_disclosures = list(),
foreign_entities = list(), affiliated_organizations = list())))
So taking this highly nested data, call it
my.data
then I tried to flatten it by
flat.df <- lapply(my.data, function(i) list(unlist(i, recursive = F)))
it sort of worked, but still each element in the flat.df list still have multiple sublists, for example "lobbying_activities", "lobbyists". And they are not expaned(I want the information inside).
But if I set recursive to "TRUE", then the flattened list have duplicated columns and most frustratingly I see some columns got lumped together (for example people's name got into expense column)
Ideally I want to flat such sublists in each table and make the whole thing in one table. Then join them into a dataframe by
df<- as.data.frame(do.call("rbind", flat.df))
CodePudding user response:
tibble
s are a nice format, as they support nested data.frames. I would aim for a tibble with 2 rows, a wide format. In it, each nested list element would be its own data.frame, which we could manipulate later when needed. I would do something like this:
library(tidyverse)
l = unlist(l, recursive = F)
ind_to_nest <- which(map_lgl(l[[1]], is.list))
non_tbl <- map(l, ~ .x[-ind_to_nest])
tbl <- map(l, ~ .x[ind_to_nest])
df <- bind_rows(non_tbl) %>%
mutate(n = 1:n(), .before = 1) %>%
mutate(data = map(tbl, ~ map(.x, ~flatten(.x) %>% bind_cols))) %>%
unnest_wider(data, simplify = F)
Note that this does throw a bunch of warnings. This is because of the name conflicts present within the list.
#> New names:
#> * id -> id...5
#> * id -> id...10
Can be resolved by specifying a naming policy, or by rethinking how the data is read into R to resolve naming conflicts early.
#> Outer names are only allowed for unnamed scalar atomic inputs
This is a bit tougher to resolve, but this issue is a starting point.
For analysis some cleaning of sub-tibbles can be performed when needed, as different tasks require different shapes.