Home > Mobile >  R data.table merge by grouped data.table
R data.table merge by grouped data.table

Time:10-31

I have a large dataset (~800M rows) as a data.table. The dataset consists out of equidistant timeseries data for thousands of IDs. My problem is that missing values were originally not encoded but are really missing in the dataset. So, I would like to add the rows with missing data. I know that for each ID the same timestamps should be present.

Given the size of the dataset my initial idea was to create one data.table which includes every timestep the data should include and then use merge with all=TRUE, for each ID of the main data.table. However so far, I have only managed to do that if my data.table with all-time steps (complete_dt) includes also the ID column. However, this creates a lot of redundant information, as each ID should have the same timesteps.

I made a MWE - for simplicity as my data is equidistant, I have replaced the POSIXct column with a simple integer column

library(data.table)

# My main dataset 
set.seed(123)
main_dt <- data.table(id = as.factor(rep(1:3, c(5,4,3))), 
                   pseudo_time = c(1,3,4,6,7, 1,3,4,5, 3,5,6),
                   value = runif(12))

# Assuming that I should have the pseudo timesteps 1:7 for each ID
# Given the size of my real data I would like to create the pseudo time not for each ID but only once
complete_dt <- main_dt[, list(pseudo_time = 1:7), by = id]

#The dt I need to get in the end
result_dt <- merge.data.table(main_dt,complete_dt, all = TRUE )

I have seen this so what similar question Merge (full join) recursively one data.table with each group of another data.table, but I have not managed to apply this to my problem.

Any help for a more efficient solution then mine would be much appreciated.

CodePudding user response:

Here is an alternative but probably not much more efficient:

setkey(main_dt, id, pseudo_time)
main_dt[CJ(id, pseudo_time = 1:7, unique = TRUE)]
  • Related