Home > Software design >  Is there a way to keep the maximum value (by subset) of a column and remove those NAs using .SD and
Is there a way to keep the maximum value (by subset) of a column and remove those NAs using .SD and

Time:09-09

thanks for reading and for your help. I am trying to compose the information of a data.table. and it seems that I have not been successful, and little by little I have run out of ideas. I have used the data.table package to try to fix the problem, but have not had very good results.

I have the following datatables:

D1<-data.table(store = c("store1"),
           client = c("123456"),
           week = c(2001,2002,2003, 2004, 2005, 2006, 2007 , 2008, 2007, 2010, 2011,2012, 2013, 2014,2015),
           sells = c(3434,NA,6566, NA, 8788,4343,NA , NA, NA, NA, NA,NA, NA, NA,NA))

D2<-data.table(store = c("store1"),
           client = c("654321"),
           week = c(2001,2002,2003, 2004, 2005, 2006, 2007 , 2008, 2007, 2010, 2011,2012, 2013, 2014,2015),
           sells = c(45455,45454,5454, NA, 65656,5858,43434 , 55898, NA, NA, NA,NA, NA, NA,NA))

DT<-rbind(D1,D2)

my DT is:

    store client week sells
 1: store1 123456 2001  3434
 2: store1 123456 2002    NA
 3: store1 123456 2003  6566
 4: store1 123456 2004    NA
 5: store1 123456 2005  8788
 6: store1 123456 2006  4343
 7: store1 123456 2007    NA
 8: store1 123456 2008    NA
 9: store1 123456 2007    NA
10: store1 123456 2010    NA
11: store1 123456 2011    NA
12: store1 123456 2012    NA
13: store1 123456 2013    NA
14: store1 123456 2014    NA
15: store1 123456 2015    NA
16: store1 654321 2001 45455
17: store1 654321 2002 45454
18: store1 654321 2003  5454
19: store1 654321 2004    NA
20: store1 654321 2005 65656
21: store1 654321 2006  5858
22: store1 654321 2007 43434
23: store1 654321 2008 55898
24: store1 654321 2007    NA
25: store1 654321 2010    NA
26: store1 654321 2011    NA
27: store1 654321 2012    NA
28: store1 654321 2013    NA
29: store1 654321 2014    NA
30: store1 654321 2015    NA

I need to keep the last week of each store and client, and remove the NAs after the maximum week

     store client week sells
 1: store1 123456 2001  3434
 2: store1 123456 2002    NA
 3: store1 123456 2003  6566
 4: store1 123456 2004    NA
 5: store1 123456 2005  8788
 6: store1 123456 2006  4343
 7: store1 654321 2001 45455
 8: store1 654321 2002 45454
 9: store1 654321 2003  5454
10: store1 654321 2004    NA
11: store1 654321 2005 65656
12: store1 654321 2006  5858
13: store1 654321 2007 43434
14: store1 654321 2008 55898

I have tried to use the .SD function together with which.max() to get the maximums, but I can't solve it

max<-DT[, .SD[which.max(week)], keyby = c("store","client")]

Thank you very much for your kind answers.

CodePudding user response:

You're close, but which.max is only going to return a single row (per group). Try this:

DT[, .SD[ rev(cumsum(rev(!is.na(sells))) > 0), ], by = .(store, client) ]
#      store client  week sells
#     <char> <char> <num> <num>
#  1: store1 123456  2001  3434
#  2: store1 123456  2002    NA
#  3: store1 123456  2003  6566
#  4: store1 123456  2004    NA
#  5: store1 123456  2005  8788
#  6: store1 123456  2006  4343
#  7: store1 654321  2001 45455
#  8: store1 654321  2002 45454
#  9: store1 654321  2003  5454
# 10: store1 654321  2004    NA
# 11: store1 654321  2005 65656
# 12: store1 654321  2006  5858
# 13: store1 654321  2007 43434
# 14: store1 654321  2008 55898

The premise:

  • !is.na(sells) returns true or false;
  • when a logical is summed, false is 0 and true is 1
  • cumsum(.) > 0 is effectively a rolling "if any", meaning that as soon as one value is true, the whole conditional remains true for the rest of the values
  • because we want to go from bottom-up, however, we need to do the double-rev; if you envision going from bottom-to-top in any particular group, this means "starting from the first (bottom-most) week with a non-NA sells, keep that row and everything above it"

CodePudding user response:

Another attempt, using the .I rows index to subset all rows with an index <= to the last non-NA:

DT[DT[, .I <= .I[last(which(!is.na(sells)))], by=.(store,client)]$V1]

CodePudding user response:

You were close, but you can use .SD within .SD[] to do this:

DT[, .SD[1:which(week == .SD[!is.na(sells), max(week)])], keyby = c("store","client")]

Note: which.max doesn't work because of trying to remove the NA's after the last week with sells.

  • Related