I have a data table organized by id
and year
, with a frequency (freq
) value for every year where the frequency is at least 1. The start and end year may differ for every id.
Example:
> dt <- data.table(id=c('A','A','A','A','B','B','B','B'),year=c(2010,2012,2013,2015,2006,2007,2010,2011),freq=c(2,1,4,3,1,3,5,7))
> dt
id year freq
1: A 2010 2
2: A 2012 1
3: A 2013 4
4: A 2015 3
5: B 2006 1
6: B 2007 3
7: B 2010 5
8: B 2011 7
I would like to make each time series by id complete, i.e. add rows with freq=0 for any missing year. So the result for the example above should look like this:
id year freq
A 2010 2
A 2011 0
A 2012 1
A 2013 4
A 2014 0
A 2015 3
B 2006 1
B 2007 3
B 2008 0
B 2009 0
B 2010 5
B 2011 7
I'm starting with data.table and I'm interested to see if this is doable. With plyr
or dplyr
I would have used a merge operation with a complete column of years for every sub dataframe by id. Is there an equivalent to this solution with data.table?
CodePudding user response:
We can't use CJ
-based approaches because the missing rows need to be by-id
. An alternative is:
library(data.table)
dt[ dt[, .(year = do.call(seq, as.list(range(year)))), by = .(id)],
on = .(id, year)
][is.na(freq), freq := 0][]
# id year freq
# <char> <int> <num>
# 1: A 2010 2
# 2: A 2011 0
# 3: A 2012 1
# 4: A 2013 4
# 5: A 2014 0
# 6: A 2015 3
# 7: B 2006 1
# 8: B 2007 3
# 9: B 2008 0
# 10: B 2009 0
# 11: B 2010 5
# 12: B 2011 7
CodePudding user response:
Another solution, maybe more explicit than @r2evans'? First make a table of complete series:
years <- dt[, list(year= seq(min(year), max(year))), by= id]
years
id year
1: A 2010
2: A 2011
3: A 2012
4: A 2013
5: A 2014
6: A 2015
7: B 2006
8: B 2007
9: B 2008
10: B 2009
11: B 2010
12: B 2011
then merge and replace NAs:
full <- merge(dt, years, all.y= TRUE)
full[, freq := ifelse(is.na(freq), 0, freq)]
full
id year freq
1: A 2010 2
2: A 2011 0
3: A 2012 1
4: A 2013 4
5: A 2014 0
6: A 2015 3
7: B 2006 1
8: B 2007 3
9: B 2008 0
10: B 2009 0
11: B 2010 5
12: B 2011 7
CodePudding user response:
Here is another data.table way to solve your problem:
dt[, .SD[.(min(year):max(year)), on="year"], by=id][is.na(freq), freq:=0]
# id year freq
# <char> <int> <num>
# 1: A 2010 2
# 2: A 2011 0
# 3: A 2012 1
# 4: A 2013 4
# 5: A 2014 0
# 6: A 2015 3
# 7: B 2006 1
# 8: B 2007 3
# 9: B 2008 0
# 10: B 2009 0
# 11: B 2010 5
# 12: B 2011 7