Home > OS >  Select rows until last non-missing value by group
Select rows until last non-missing value by group

Time:09-10

I have the following data:

library(data.table)

D1 <- data.table(store = "store1",
                 client = "123456",
                 week = c(2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2007, 2010, 2011,2012, 2013, 2014, 2015),
                 sells = c(3434, NA, 6566, NA, 8788, 4343, NA , NA, NA, NA, NA, NA, NA, NA, NA))

D2 <- data.table(store = "store1",
                client = "654321",
                week = c(2001,2002,2003, 2004, 2005, 2006, 2007 , 2008, 2007, 2010, 2011, 2012, 2013, 2014, 2015),
                sells = c(45455, 45454, 5454, NA, 65656, 5858, 43434, 55898, NA, NA, NA, NA, NA, NA, NA))

DT <- rbind(D1, D2)

For each store and client in DT, I want to keep rows until the last non-NA value of sells, and remove the subsequent rows.

     store client week sells
1:  store1 123456 2001  3434
2:  store1 123456 2002    NA
3:  store1 123456 2003  6566
4:  store1 123456 2004    NA
5:  store1 123456 2005  8788
6:  store1 123456 2006  4343 # last non-NA for store1, client 123456
7:  store1 123456 2007    NA
8:  store1 123456 2008    NA
9:  store1 123456 2007    NA
10: store1 123456 2010    NA
11: store1 123456 2011    NA
12: store1 123456 2012    NA
13: store1 123456 2013    NA
14: store1 123456 2014    NA
15: store1 123456 2015    NA
16: store1 654321 2001 45455
17: store1 654321 2002 45454
18: store1 654321 2003  5454
19: store1 654321 2004    NA
20: store1 654321 2005 65656
21: store1 654321 2006  5858
22: store1 654321 2007 43434
23: store1 654321 2008 55898 # last non-NA for store1, client 654321
24: store1 654321 2007    NA
25: store1 654321 2010    NA
26: store1 654321 2011    NA
27: store1 654321 2012    NA
28: store1 654321 2013    NA
29: store1 654321 2014    NA
30: store1 654321 2015    NA

Desired result:

     store client week sells
 1: store1 123456 2001  3434
 2: store1 123456 2002    NA
 3: store1 123456 2003  6566
 4: store1 123456 2004    NA
 5: store1 123456 2005  8788
 6: store1 123456 2006  4343
 7: store1 654321 2001 45455
 8: store1 654321 2002 45454
 9: store1 654321 2003  5454
10: store1 654321 2004    NA
11: store1 654321 2005 65656
12: store1 654321 2006  5858
13: store1 654321 2007 43434
14: store1 654321 2008 55898

I have tried using .SD together with which.max to get the maximum week, but I can't solve it

DT[, .SD[which.max(week)], keyby = c("store", client")]

Thank you very much for your kind answers.

CodePudding user response:

Another attempt, using the .I rows index to subset all rows with an index <= to the last non-NA:

DT[DT[, .I <= .I[last(which(!is.na(sells)))], by=.(store,client)]$V1]

CodePudding user response:

You're close, but which.max is only going to return a single row (per group). Try this:

DT[, .SD[ rev(cumsum(rev(!is.na(sells))) > 0), ], by = .(store, client) ]
#      store client  week sells
#     <char> <char> <num> <num>
#  1: store1 123456  2001  3434
#  2: store1 123456  2002    NA
#  3: store1 123456  2003  6566
#  4: store1 123456  2004    NA
#  5: store1 123456  2005  8788
#  6: store1 123456  2006  4343
#  7: store1 654321  2001 45455
#  8: store1 654321  2002 45454
#  9: store1 654321  2003  5454
# 10: store1 654321  2004    NA
# 11: store1 654321  2005 65656
# 12: store1 654321  2006  5858
# 13: store1 654321  2007 43434
# 14: store1 654321  2008 55898

The premise:

  • !is.na(sells) returns true or false;
  • when a logical is summed, false is 0 and true is 1
  • cumsum(.) > 0 is effectively a rolling "if any", meaning that as soon as one value is true, the whole conditional remains true for the rest of the values
  • because we want to go from bottom-up, however, we need to do the double-rev; if you envision going from bottom-to-top in any particular group, this means "starting from the first (bottom-most) week with a non-NA sells, keep that row and everything above it"

CodePudding user response:

You were close, but you can use .SD within .SD[] to do this:

DT[, .SD[1:which(week == .SD[!is.na(sells), max(week)])], keyby = c("store","client")]

Note: which.max doesn't work because of trying to remove the NA's after the last week with sells.

  • Related