Home > Back-end >  Merge multiple data.tables and rename columns to reflect origin
Merge multiple data.tables and rename columns to reflect origin

Time:10-05

As a simplified example of my problem, say I have four data.tables dt1, ..., dt4, all of which have the same structure:

head(dt1)
          date          x           y
 1: 2000-10-01  0.4527087 -0.11590788
 2: 2001-10-01  0.7200252 -0.55722270
 3: 2002-10-01 -1.3804472 -1.47030087
 4: 2003-10-01 -0.1380225  2.34157766
 5: 2004-10-01 -0.9288675 -1.32993998
 6: 2005-10-01 -0.9592633  0.76316150

That is, they all have three columns called date, x and y. My desired output is a merged data.table (on date) with five columns: date, and then the x column from each individual table renamed to reflect its original data.table:

head(desired_output)
          date      x_dt1       x_dt2      x_dt3      x_dt4
 1: 2000-10-01  0.4527087 -0.11590788  1.1581946 -1.5159040
 2: 2001-10-01  0.7200252 -0.55722270 -1.6247254 -0.3325556
 3: 2002-10-01 -1.3804472 -1.47030087 -0.9766309 -0.2368857
 4: 2003-10-01 -0.1380225  2.34157766  1.1831091 -0.4399184
 5: 2004-10-01 -0.9288675 -1.32993998  0.8716144 -0.4086229
 6: 2005-10-01 -0.9592633  0.76316150 -0.8860816 -0.4299365

I assume this can be done using the suffixes argument of merge.data.table somehow. I have tried to modify mergeDTs from this answer without success as yet. A solution that successfully modifies mergeDTs (or just using a function that could be applied to a list of of several data.tables) would be excellent.

I am aware of this very slick dplyr/purrr answer but would prefer a data.table solution.


Example data

library(data.table)
dt1 <- data.table(date = seq(from = as.Date("2000-10-01"), to = as.Date("2010-10-01"), by = "years"),
                  x = rnorm(11),
                  y = rnorm(11))

dt2 <- data.table(date = seq(from = as.Date("2000-10-01"), to = as.Date("2010-10-01"), by = "years"),
                  x = rnorm(11),
                  y = rnorm(11))

dt3 <- data.table(date = seq(from = as.Date("2000-10-01"), to = as.Date("2010-10-01"), by = "years"),
                  x = rnorm(11),
                  y = rnorm(11))

dt4 <- data.table(date = seq(from = as.Date("2000-10-01"), to = as.Date("2010-10-01"), by = "years"),
                  x = rnorm(11),
                  y = rnorm(11))

CodePudding user response:

Here is a possible approach accumulating the merged result in a simple for loop:

library(data.table)

dt <- dt1[, .(date, x)]

for(i in 2:4) {
  dt <- merge(dt, get(paste0("dt", i))[, .(date, x)], by = "date", suffixes = c("", paste0("_dt", i)))
}

setnames(dt, old = "x", new = "x_dt1")

head(dt)
#>          date      x_dt1      x_dt2        x_dt3      x_dt4
#> 1: 2000-10-01 -1.5035218  2.0463775 -0.120544283 -0.5662290
#> 2: 2001-10-01  0.5977386 -0.1968421 -0.840102174  1.2412272
#> 3: 2002-10-01 -0.9100557 -0.1687148 -1.738526471  1.3685767
#> 4: 2003-10-01  0.7027232  0.9009135 -0.247273205  1.3135718
#> 5: 2004-10-01  0.5269265  0.6176381 -0.007662592 -0.2928206
#> 6: 2005-10-01 -0.8350406 -0.7343245 -0.643701996  2.3068948

Or alternatively, accumulating the merged result with Reduce():

Reduce(
  f = function(dt, dti) merge(dt, get(dti)[, .(date, x)], by = "date", suffixes = c("", paste0("_", dti))),
  x = paste0("dt", 1:4),
  init = dt1[, .(date, x)]
)[, x := NULL][]

Note: to get rid of the get() calls we can collect all data.tables in a list before merging or write a small function wrapper, e.g.

merge_dts <- function(...) {
  dts <- list(...)
  dt <- dts[[1]][, .(date, x)]
  for(i in seq_along(dts)[-1]) {
    dt <- merge(dt, dts[[i]][, .(date, x)], by = "date", suffixes = c("", paste0("_dt", i)))
  }
  setnames(dt, old = "x", new = "x_dt1")
  return(dt)
}

merge_dts(dt1, dt2, dt3, dt4)

CodePudding user response:

Another solution:

nms = paste0("dt", 1:4)

lapply(mget(nms), `[`, j=c("date", "x")) |> 
  Map(f=setnames, "x", paste0("x_", nms)) |> 
  Reduce(f=\(x,y) merge(x, y, by="date"))

          date      x_dt1      x_dt2      x_dt3       x_dt4
        <Date>      <num>      <num>      <num>       <num>
 1: 2000-10-01  1.5567877 -0.3579598  1.9765388 -0.76325891
 2: 2001-10-01 -0.2656003  0.6599606 -0.8085454 -0.43846379
 3: 2002-10-01 -0.3620652 -1.1952980 -0.7108491 -0.57261896
 4: 2003-10-01  2.2358663  0.3903301 -1.0370673  1.27767750
 5: 2004-10-01  0.7298361 -0.4111746 -0.5046095  0.35132983
 6: 2005-10-01 -0.2749529 -0.3080462 -1.5936609  0.49925692
 7: 2006-10-01 -0.4624282 -0.6598809 -2.0893062 -0.10963342
 8: 2007-10-01 -1.3491373  1.1325984 -0.4708666  0.15541533
 9: 2008-10-01 -1.9459169  0.3331132 -0.5091114 -0.56520321
10: 2009-10-01 -0.8994460 -0.5218689 -0.4177664  0.69639938
11: 2010-10-01  0.8663232 -1.6789679 -0.7827229 -0.05562114
  • Related