Home > front end >  Convert nested list with different names to data.frame filling NA and adding column
Convert nested list with different names to data.frame filling NA and adding column

Time:05-11

I need a base R solution to convert nested list with different names to a data.frame

mylist <- list(list(a=1,b=2), list(a=3), list(b=5), list(a=9, z=list('k'))

convert(mylist)
## returns a data.frame:
##
##     a     b    z           
##     1     2    <NULL>   
##     3    NA    <NULL>   
##    NA     5    <NULL>   
##     9    NA    <chr [1]>

I know this could be easily done with dplyr::bind_rows, but I need a solution in base R. To simplify the problem, it is also fine with a 2-level nested list that has no 3rd level lists such as

mylist <- list(list(a=1,b=2), list(a=3), list(b=5), list(a=9, z='k'))

convert(mylist)
## returns a data.frame:
##
##     a     b    z           
##     1     2    NA   
##     3    NA    NA   
##    NA     5    NA   
##     9    NA    k  

I have tried something like

convert <- function(L) as.data.frame(do.call(rbind, L))

This does not fill NA and add additional column z

Update

mylist here is just a simplified example. In reality I could not assume the names of the sublist elements (a, b and z in the example), nor the sublists lengths (2, 1, 1, 2 in the example).

Here are the assumptions for expected data.frame and the input mylist:

  1. The column number of the expected data.frame is determined by the maximum length of the sublists which could vary from 1 to several hundreds. There is no explicit source of information about the length of each sublist (which names will appear or disappear in which sublist is unknown) max(sapply(mylist, length)) <= 1000 ## ==> TRUE
  2. The row number of the expected data.frame is determined by the length of mylist which could vary from 1 to several thousands dplyr::between(length(mylist), 0, 10000) ## ==> TRUE
  3. No explicit information for the names of the sublist elements and their orders, therefore the column names and order of the expected data.frame can only be determined intrinsically from mylist
  4. Each sublist contains elements in types of numeric, character or list. To simplify the problem, consider only numeric and character.

CodePudding user response:

You can do something like the following:

mylist <- list(list(a=1,b=2), list(a=3), list(b=5), list(a=9, z='k'))

convert <- function(mylist){
  col_names <- NULL
  # get all the unique names and create the df
  for(i in 1:length(mylist)){
    col_names <- c(col_names, names(mylist[[i]]))
  }
  col_names <- unique(col_names)
  df <- data.frame(matrix(ncol=length(col_names),
                          nrow=length(mylist)))
  colnames(df) <- col_names
  
  # join data to row in df
  for(i in 1:length(mylist)){
    for(j in 1:length(mylist[[i]])){
      df[i, names(mylist[[i]])[j]] <- mylist[[i]][names(mylist[[i]])[j]]
    }
  }
  return(df)
}

df <- convert(mylist)
> df
   a  b    z
1  1  2 <NA>
2  3 NA <NA>
3 NA  5 <NA>
4  9 NA    k

CodePudding user response:

A shorter solution in base R would be

make_df <- function(a = NA, b = NA, z = NA) {
  data.frame(a = unlist(a), b = unlist(b), z = unlist(z))
}

do.call(rbind, lapply(mylist, function(x) do.call(make_df, x)))
#>    a  b    z
#> 1  1  2 <NA>
#> 2  3 NA <NA>
#> 3 NA  5 <NA>
#> 4  9 NA    k

Update

A more general solution using the same method, but which does not require specific names would be:

build_data_frame <- function(obj) {
  nms     <- unique(unlist(sapply(obj, names)))
  frmls   <- as.list(setNames(rep(NA, length(nms)), nms))
  dflst   <- setNames(lapply(nms, function(x) call("unlist", as.symbol(x))), nms)
  make_df <- as.function(c(frmls, call("do.call", "data.frame", dflst)))
  
  do.call(rbind, lapply(mylist, function(x) do.call(make_df, x)))
}

This allows

build_data_frame(mylist)
#>    a  b    z
#> 1  1  2 <NA>
#> 2  3 NA <NA>
#> 3 NA  5 <NA>
#> 4  9 NA    k

CodePudding user response:

I've got a solution. Note this only uses the pipe, and could be exchanged for native pipe, etc.

mylist %>% 
  #' first, ensure that the 2nd level is flat,
  lapply(. %>% lapply(FUN = unlist, recursive = FALSE)) %>%
  #' replace missing vars with `NA`
  lapply(function(x, vars) {
    x[vars[!vars %in% names(x)]]<-NA
    x
  }, vars = {.} %>% unlist() %>% names() %>% unique()) %>%
  do.call(what = rbind) %>%
  #' do nothing
  identity()
  • Related