Home > Enterprise >  data.table: is it possible to merge .SD and return a new 'sub data table' by group?
data.table: is it possible to merge .SD and return a new 'sub data table' by group?

Time:11-03

I have a data table organized by id and year, with a frequency (freq) value for every year where the frequency is at least 1. The start and end year may differ for every id.

Example:

> dt <- data.table(id=c('A','A','A','A','B','B','B','B'),year=c(2010,2012,2013,2015,2006,2007,2010,2011),freq=c(2,1,4,3,1,3,5,7))
> dt
   id year freq
1:  A 2010    2
2:  A 2012    1
3:  A 2013    4
4:  A 2015    3
5:  B 2006    1
6:  B 2007    3
7:  B 2010    5
8:  B 2011    7

I would like to make each time series by id complete, i.e. add rows with freq=0 for any missing year. So the result for the example above should look like this:

 id year freq
  A 2010    2
  A 2011    0
  A 2012    1
  A 2013    4
  A 2014    0
  A 2015    3
  B 2006    1
  B 2007    3
  B 2008    0
  B 2009    0
  B 2010    5
  B 2011    7

I'm starting with data.table and I'm interested to see if this is doable. With plyr or dplyr I would have used a merge operation with a complete column of years for every sub dataframe by id. Is there an equivalent to this solution with data.table?

CodePudding user response:

We can't use CJ-based approaches because the missing rows need to be by-id. An alternative is:

library(data.table)
dt[ dt[, .(year = do.call(seq, as.list(range(year)))), by = .(id)],
    on = .(id, year)
  ][is.na(freq), freq := 0][]
#         id  year  freq
#     <char> <int> <num>
#  1:      A  2010     2
#  2:      A  2011     0
#  3:      A  2012     1
#  4:      A  2013     4
#  5:      A  2014     0
#  6:      A  2015     3
#  7:      B  2006     1
#  8:      B  2007     3
#  9:      B  2008     0
# 10:      B  2009     0
# 11:      B  2010     5
# 12:      B  2011     7

CodePudding user response:

Another solution, maybe more explicit than @r2evans'? First make a table of complete series:

years <- dt[, list(year= seq(min(year), max(year))), by= id]
years
    id year
 1:  A 2010
 2:  A 2011
 3:  A 2012
 4:  A 2013
 5:  A 2014
 6:  A 2015
 7:  B 2006
 8:  B 2007
 9:  B 2008
10:  B 2009
11:  B 2010
12:  B 2011

then merge and replace NAs:

full <- merge(dt, years, all.y= TRUE)
full[, freq := ifelse(is.na(freq), 0, freq)]

full
    id year freq
 1:  A 2010    2
 2:  A 2011    0
 3:  A 2012    1
 4:  A 2013    4
 5:  A 2014    0
 6:  A 2015    3
 7:  B 2006    1
 8:  B 2007    3
 9:  B 2008    0
10:  B 2009    0
11:  B 2010    5
12:  B 2011    7

CodePudding user response:

Here is another data.table way to solve your problem:

dt[, .SD[.(min(year):max(year)), on="year"], by=id][is.na(freq), freq:=0]

#         id  year  freq
#     <char> <int> <num>
#  1:      A  2010     2
#  2:      A  2011     0
#  3:      A  2012     1
#  4:      A  2013     4
#  5:      A  2014     0
#  6:      A  2015     3
#  7:      B  2006     1
#  8:      B  2007     3
#  9:      B  2008     0
# 10:      B  2009     0
# 11:      B  2010     5
# 12:      B  2011     7
  • Related