Home > Back-end >  flattern nested list with uneven column numbers into data frame in R
flattern nested list with uneven column numbers into data frame in R

Time:02-08

I face a challenge of binding a nested list into a dataframe for processing.

Edit: here below is the example of original nested data before I tried to flat them.

list(list(list(url = "https://lda.senate.gov/api/v1/filings/5e4bbd96-db94-4ea3-a310-7a7fb1e93fff/", 
    filing_uuid = "5e4bbd96-db94-4ea3-a310-7a7fb1e93fff", filing_type = "Q1", 
    filing_type_display = "1st Quarter - Report", filing_year = 2021L, 
    filing_period = "first_quarter", filing_period_display = "1st Quarter (Jan 1 - Mar 31)", 
    filing_document_url = "https://lda.senate.gov/filings/public/filing/5e4bbd96-db94-4ea3-a310-7a7fb1e93fff/print/", 
    filing_document_content_type = "text/html", income = "15000.00", 
    expenses = NULL, expenses_method = NULL, expenses_method_display = NULL, 
    posted_by_name = "Christian Smith", dt_posted = "2021-04-30T10:20:59.217000-04:00", 
    termination_date = NULL, registrant = list(id = 8214L, url = "https://lda.senate.gov/api/v1/registrants/8214/", 
        house_registrant_id = 31113L, name = "CAPSTONE NATIONAL PARTNERS", 
        description = "public affairs", address_1 = "501 Capitol Court NE", 
        address_2 = "Suite 100", address_3 = NULL, address_4 = NULL, 
        city = "Washington", state = "DC", state_display = "District of Columbia", 
        zip = "20002", country = "US", country_display = "United States of America", 
        ppb_country = "US", ppb_country_display = "United States of America", 
        contact_name = "", contact_telephone = "", dt_updated = "2022-01-13T14:47:31.828778-05:00"), 
    client = list(id = 111342L, url = "https://lda.senate.gov/api/v1/clients/111342/", 
        client_id = 303L, name = "OSHKOSH CORPORATION", general_description = "manufacturing", 
        client_government_entity = FALSE, client_self_select = NULL, 
        state = "WI", state_display = "Wisconsin", country = "US", 
        country_display = "United States of America", ppb_state = "WI", 
        ppb_state_display = "Wisconsin", ppb_country = "US", 
        ppb_country_display = "United States of America", effective_date = "2016-04-01"), 
    lobbying_activities = list(list(general_issue_code = "BUD", 
        general_issue_code_display = "Budget/Appropriations", 
        description = "FY22 Appropriations", foreign_entity_issues = "", 
        lobbyists = list(list(lobbyist = list(id = 63767L, prefix = NULL, 
            prefix_display = NULL, first_name = "WILLIAM", nickname = NULL, 
            middle_name = NULL, last_name = "STONE", suffix = NULL, 
            suffix_display = NULL), covered_position = "Chief of Staff, Dave Obey: House Appropriations Committee", 
            new = FALSE)), government_entities = list(list(id = 2L, 
            name = "HOUSE OF REPRESENTATIVES"), list(id = 1L, 
            name = "SENATE")))), conviction_disclosures = list(), 
    foreign_entities = list(), affiliated_organizations = list())), 
    list(list(url = "https://lda.senate.gov/api/v1/filings/177b995a-3be2-4127-b962-795e76974617/", 
        filing_uuid = "177b995a-3be2-4127-b962-795e76974617", 
        filing_type = "Q1", filing_type_display = "1st Quarter - Report", 
        filing_year = 2021L, filing_period = "first_quarter", 
        filing_period_display = "1st Quarter (Jan 1 - Mar 31)", 
        filing_document_url = "https://lda.senate.gov/filings/public/filing/177b995a-3be2-4127-b962-795e76974617/print/", 
        filing_document_content_type = "text/html", income = "22500.00", 
        expenses = NULL, expenses_method = NULL, expenses_method_display = NULL, 
        posted_by_name = "Doyce Boesch", dt_posted = "2021-04-30T11:22:12.233000-04:00", 
        termination_date = NULL, registrant = list(id = 400677020L, 
            url = "https://lda.senate.gov/api/v1/registrants/400677020/", 
            house_registrant_id = NULL, name = "MR. DOYCE BOESCH", 
            description = "Government Relations", address_1 = "4515 W Street NW", 
            address_2 = NULL, address_3 = NULL, address_4 = NULL, 
            city = "Washington", state = "DC", state_display = "District of Columbia", 
            zip = "20007", country = "US", country_display = "United States of America", 
            ppb_country = "US", ppb_country_display = "United States of America", 
            contact_name = "DOYCE BOESCH", contact_telephone = " 1 202-731-9995", 
            dt_updated = "2022-01-13T14:59:12.412096-05:00"), 
        client = list(id = 194057L, url = "https://lda.senate.gov/api/v1/clients/194057/", 
            client_id = 75L, name = "INVESTMENT COMPANY INSTITUTE", 
            general_description = "Stock Market and Financial Services", 
            client_government_entity = FALSE, client_self_select = FALSE, 
            state = "DC", state_display = "District of Columbia", 
            country = "US", country_display = "United States of America", 
            ppb_state = NULL, ppb_state_display = NULL, ppb_country = "US", 
            ppb_country_display = "United States of America", 
            effective_date = "2012-07-01"), lobbying_activities = list(
            list(general_issue_code = "FIN", general_issue_code_display = "Financial Institutions/Investments/Securities", 
                description = "providing awareness of members positions", 
                foreign_entity_issues = "", lobbyists = list(
                  list(lobbyist = list(id = 52828L, prefix = NULL, 
                    prefix_display = NULL, first_name = "DOYCE", 
                    nickname = NULL, middle_name = NULL, last_name = "BOESCH", 
                    suffix = NULL, suffix_display = NULL), covered_position = NULL, 
                    new = FALSE)), government_entities = list(
                  list(id = 2L, name = "HOUSE OF REPRESENTATIVES"), 
                  list(id = 1L, name = "SENATE")))), conviction_disclosures = list(), 
        foreign_entities = list(), affiliated_organizations = list())))

So taking this highly nested data, call it

my.data

then I tried to flatten it by

flat.df <- lapply(my.data, function(i) list(unlist(i, recursive = F)))

it sort of worked, but still each element in the flat.df list still have multiple sublists, for example "lobbying_activities", "lobbyists". And they are not expaned(I want the information inside).

But if I set recursive to "TRUE", then the flattened list have duplicated columns and most frustratingly I see some columns got lumped together (for example people's name got into expense column)

Ideally I want to flat such sublists in each table and make the whole thing in one table. Then join them into a dataframe by

df<- as.data.frame(do.call("rbind", flat.df))

CodePudding user response:

tibbles are a nice format, as they support nested data.frames. I would aim for a tibble with 2 rows, a wide format. In it, each nested list element would be its own data.frame, which we could manipulate later when needed. I would do something like this:

library(tidyverse)
l = unlist(l, recursive = F)
ind_to_nest <- which(map_lgl(l[[1]], is.list))
non_tbl <- map(l, ~ .x[-ind_to_nest])
tbl <- map(l, ~ .x[ind_to_nest])

df <- bind_rows(non_tbl) %>%
  mutate(n = 1:n(), .before = 1) %>%
  mutate(data =  map(tbl,  ~ map(.x, ~flatten(.x) %>% bind_cols))) %>%
  unnest_wider(data, simplify = F)

Note that this does throw a bunch of warnings. This is because of the name conflicts present within the list.

#> New names:
#> * id -> id...5
#> * id -> id...10

Can be resolved by specifying a naming policy, or by rethinking how the data is read into R to resolve naming conflicts early.

#> Outer names are only allowed for unnamed scalar atomic inputs 

This is a bit tougher to resolve, but this issue is a starting point.

For analysis some cleaning of sub-tibbles can be performed when needed, as different tasks require different shapes.

  •  Tags:  
  • Related