I a looking for an efficient way to replace missing data in data.table
.
When 'description' is "", I want to replace it with the 'description' with the latest date with the same id. So for seqs=3
, I want to replace "" with "Red" because seqs=2
is not "" and is the latest date. Seqs=4
would just remain "" because there is no other id.
I am working with several million rows and am using data.table
(so I would love to not convert to the tidyverse
)
set.seed(3)
data.frame(
seqs = 1:11 ,
id = c( 1 , 1 , 1 , 2 , 3 , 3 , 4 ,4 , 4, 4 , 5 ) ,
description = c("Red","Red","","","Blue","Blue","Red","Red","","Red","Blue"),
dates = sample( 2000:2020, 11, TRUE)
)
)
CodePudding user response:
One approach using last
library(data.table)
setDT(dt)
dt[order(id, dates),
description := ifelse(description == "", last(description), description), by = id]
dt
seqs id description dates
1: 1 1 Red 2017
2: 2 1 Red 2015
3: 3 1 Red 2011
4: 4 2 2007
5: 5 3 Blue 2018
6: 6 3 Blue 2005
7: 7 4 Red 2007
8: 8 4 Red 2014
9: 9 4 Red 2009
10: 10 4 Red 2010
11: 11 5 Blue 2002
Data
dt <- structure(list(seqs = 1:11, id = c(1, 1, 1, 2, 3, 3, 4, 4, 4,
4, 5), description = c("Red", "Red", "", "", "Blue", "Blue",
"Red", "Red", "", "Red", "Blue"), dates = c(2004L, 2007L, 2009L,
2003L, 2000L, 2018L, 2005L, 2019L, 2012L, 2006L, 2000L)), class = "data.frame", row.names = c(NA,
-11L))