Home > Mobile >  Data.table in R: Fitting a function on a different subgroup than the function's application gro
Data.table in R: Fitting a function on a different subgroup than the function's application gro

Time:08-29

In a large data.table, I would like to compute the percentile of a numeric variable with respect to a date variable, and with respect to a subgroup of a group variable.

Example: In the table below I compute each value's percentile with respect to date and all members of the group-variable:

library(data.table)

set.seed(101)
dt <- data.table(group = c(rep("A",3),rep("B",3),rep("C",3), rep("D",3)), 
date = rep(as.Date(19000:19002, origin = "1970-01-01"),4), 
value=rnorm(12))
 
dt[, Percentile := ecdf(value)(value), date]

> dt
    group       date      value Percentile
 1:     A 2022-01-08 -0.3260365       0.25
 2:     A 2022-01-09  0.5524619       1.00
 3:     A 2022-01-10 -0.6749438       0.50
 4:     B 2022-01-08  0.2143595       0.75
 5:     B 2022-01-09  0.3107692       0.50
 6:     B 2022-01-10  1.1739663       1.00
 7:     C 2022-01-08  0.6187899       1.00
 8:     C 2022-01-09 -0.1127343       0.25
 9:     C 2022-01-10  0.9170283       0.75
10:     D 2022-01-08 -0.2232594       0.50
11:     D 2022-01-09  0.5264481       0.75
12:     D 2022-01-10 -0.7948444       0.25

Is there a way to compute the percentile of each value with respect to date, but the ecdf()-function only being fitted to group members, e.g. c("A","B","C")?

The result should look like this:

> dt
   group       date      value Percentile
1:     A 2022-01-08 -0.3260365  0.3333333
2:     A 2022-01-09  0.5524619  1.0000000
3:     A 2022-01-10 -0.6749438  0.3333333
4:     B 2022-01-08  0.2143595  0.6666667
5:     B 2022-01-09  0.3107692  0.6666667
6:     B 2022-01-10  1.1739663  1.0000000
7:     C 2022-01-08  0.6187899  1.0000000
8:     C 2022-01-09 -0.1127343  0.3333333
9:     C 2022-01-10  0.9170283  0.6666667
10:    D 2022-01-08 -0.2232594  0.3333333
11:    D 2022-01-09  0.5264481  0.6666667
12:    D 2022-01-10 -0.7948444  0.0000000

Thank you!

CodePudding user response:

We could use

library(data.table)
dt[, Percentile := ecdf(value[group %in% c("A", "B", "C")])(value), date]

-output

> dt
     group       date      value Percentile
    <char>     <Date>      <num>      <num>
 1:      A 2022-01-08 -0.3260365  0.3333333
 2:      A 2022-01-09  0.5524619  1.0000000
 3:      A 2022-01-10 -0.6749438  0.3333333
 4:      B 2022-01-08  0.2143595  0.6666667
 5:      B 2022-01-09  0.3107692  0.6666667
 6:      B 2022-01-10  1.1739663  1.0000000
 7:      C 2022-01-08  0.6187899  1.0000000
 8:      C 2022-01-09 -0.1127343  0.3333333
 9:      C 2022-01-10  0.9170283  0.6666667
10:      D 2022-01-08 -0.2232594  0.3333333
11:      D 2022-01-09  0.5264481  0.6666667
12:      D 2022-01-10 -0.7948444  0.0000000
  • Related