Home > Software design >  Return second max by group from data.table in R
Return second max by group from data.table in R

Time:11-13

How can I return the second max by group in R data table with the id column returned as well?

Reproducible example below:

library(data.table)
sample_dt <- data.table(
  myid=c("A","B","B","A","B","B","C","C","A","A"),
  mycol1 =c(101,-108,140,140,-150,-140,-111,101,150,140)
)

#desired output 
desired_dt <- data.table(
  myid=c("A","B","C"),
  mycol1_secondmax=c(140,-108,-111)
)

CodePudding user response:

A possible solution:

sample_dt[order(-mycol1),.SD[2],by=myid]
     myid mycol1
   <char>  <num>
1:      A    140
2:      B   -108
3:      C   -111

EDIT: comparison with frankas suggested by @r2evans:

sample_dt <- data.table(
  myid=sample(LETTERS,10^4,replace=T),
  mycol1 =sample(-100:100,10^4,replace=T)
)

microbenchmark::microbenchmark(
  sample_dt[order(-mycol1),.SD[2],by=myid],
  sample_dt[, .SD[frank(-mycol1, ties.method = "first") == 2L,], by = .(myid)],
  sample_dt[, .SD[order(-mycol1) == 2], myid]
  )

Unit: milliseconds
                                                                               expr      min       lq     mean   median       uq     max neval
                                       sample_dt[order(-mycol1), .SD[2], by = myid] 1.450001 1.850001 3.139760 3.145701 3.899852 13.6538   100
 sample_dt[, .SD[frank(-mycol1, ties.method = "first") == 2L,      ], by = .(myid)] 6.816001 7.199701 8.289604 7.638501 8.390902 22.0289   100
                                        sample_dt[, .SD[order(-mycol1) == 2], myid] 4.503801 4.824001 5.606895 5.229701 5.872300 15.9330   100

CodePudding user response:

Another data.table option

> sample_dt[, .SD[order(-mycol1) == 2], myid]
   myid mycol1
1:    A    140
2:    B   -108
3:    C   -111

CodePudding user response:

3rd option using setorder and later extract, i.e. `[` as part of a helper function:

setorder(sample_dt, myid, -mycol1)
fun <- function(x) {
  if(length(x) == 1) x else x[2]
}

Result

sample_dt[, fun(mycol1), by = myid]
#    myid   V1
# 1:    A  140
# 2:    B -108
# 3:    C -111
  • Related