Home > Software design >  Combining list of data.tables based on data.table name
Combining list of data.tables based on data.table name

Time:01-04

This is an extension to the question I posted a number of years ago: here

In summary, I have an indeterminate (in advance) number of data.tables in a list, some of which have duplicate names. I want to combine all data.tables with the same name using rbindlist, and fill any columns that don't have corresponding values with NA. i.e. use the 'fill = T' element of rbindlist.

I then want to return a list with the data.tables and name them based on their original names.

I'll give an example:

library(data.table)

# Create data for list A

A_1 <- data.table(A = 1:2,B = 2:3)
A_2 <- data.table(C = 100:102,D = 300:302,E = 1:3)
A <- list(AA = A_1,BB = A_2)

# Create data for list B

B_1 <- data.table(A = 2:4,B = 1:3,F = 4:6)
B_2 <- data.table(C = 10:12,D = 20:22,G = 1:3,I = 10:12)
B <- list(AA = B_1,BB = B_2)

# Create data for list C

C_1 <- data.table(I = 1:2,J = 3:4)
C_2 <- data.table(C = 1:3,D = 4:6)
C <- list(AA = C_1,BB = C_2)

# Create list DT which is a combination of lists A, B and C

DT <- c(A,B,C)

I can combine them easily enough using the following code.

library(purrr)

pmap(.l = list(y = unique(names(DT))),
     .f = function(y) rbindlist(DT[names(DT) %in% y],fill = T)) %>% 
          set_names(unique(names(DT)))

However, in practice, DT is formed by another function which I'd strongly prefer to simply chain (%>%) the result from that function output, and then jump straight into the pmap approach above without first 'creating' DT.

I've tried the following:

# As a substitute for the function code creating DT, I'll simply put DT itself below as 
# a representation of the output

DT %>% pmap(.l = list(y = unique(names(.))),
     .f = function(y) rbindlist(.[names(.) %in% y],fill = T))

This doesn't work and gives an error suggesting that there are unused elements of the function. Do you know why it doesn't work?

Note that I haven't then used 'set_names' to name the output from the pmap function, as the unique names of DT would have been 'forgotten' after creating the pmap output. I'm not sure how to use chains to do that (or if it's possible without including that output in the data.table output itself).

Let me know if this is unclear and I can try to explain it another way.

Any help would be greatly appreciated!

Thanks

Phil

CodePudding user response:

We could block it in braces ({}

library(purrr) # version 1.0.0
library(dplyr)
DT %>% {
     unq <- unique(names(.))
     pmap(.l = list(y = unq),
     .f = function(y) rbindlist(.[names(.) %in% y],fill = TRUE)) %>%
      setNames(unq)
  }

Or instead of looping over the names and do rbinding, split by names, which return a named list and then do the rbind

DT %>% 
  split(names(.)) %>% 
  map(list_rbind)

-output

$AA
    A  B  F  I  J
1:  1  2 NA NA NA
2:  2  3 NA NA NA
3:  2  1  4 NA NA
4:  3  2  5 NA NA
5:  4  3  6 NA NA
6: NA NA NA  1  3
7: NA NA NA  2  4

$BB
     C   D  E  G  I
1: 100 300  1 NA NA
2: 101 301  2 NA NA
3: 102 302  3 NA NA
4:  10  20 NA  1 10
5:  11  21 NA  2 11
6:  12  22 NA  3 12
7:   1   4 NA NA NA
8:   2   5 NA NA NA
9:   3   6 NA NA NA
  • Related