Home > Software engineering >  replace missing values with other data based on conditions in data table
replace missing values with other data based on conditions in data table

Time:11-24

I a looking for an efficient way to replace missing data in data.table.

When 'description' is "", I want to replace it with the 'description' with the latest date with the same id. So for seqs=3, I want to replace "" with "Red" because seqs=2 is not "" and is the latest date. Seqs=4 would just remain "" because there is no other id.

I am working with several million rows and am using data.table (so I would love to not convert to the tidyverse)

set.seed(3)

data.frame(
    seqs = 1:11 ,
    id = c( 1 , 1 , 1 , 2 , 3 , 3 , 4 ,4 , 4, 4 , 5 ) ,
    description = c("Red","Red","","","Blue","Blue","Red","Red","","Red","Blue"),
    dates = sample(  2000:2020, 11, TRUE)
    
)

)

CodePudding user response:

One approach using last

library(data.table)

setDT(dt)

dt[order(id, dates), 
  description := ifelse(description == "", last(description), description), by = id]
dt
   seqs id description dates
 1:    1  1         Red  2017
 2:    2  1         Red  2015
 3:    3  1         Red  2011
 4:    4  2              2007
 5:    5  3        Blue  2018
 6:    6  3        Blue  2005
 7:    7  4         Red  2007
 8:    8  4         Red  2014
 9:    9  4         Red  2009
10:   10  4         Red  2010
11:   11  5        Blue  2002

Data

dt <- structure(list(seqs = 1:11, id = c(1, 1, 1, 2, 3, 3, 4, 4, 4,
4, 5), description = c("Red", "Red", "", "", "Blue", "Blue",
"Red", "Red", "", "Red", "Blue"), dates = c(2004L, 2007L, 2009L,
2003L, 2000L, 2018L, 2005L, 2019L, 2012L, 2006L, 2000L)), class = "data.frame", row.names = c(NA,
-11L))
  • Related