I would like to obtain quantile in a tailored subset. For example in the following dataset:
data = data.table(x=c(rep(1,9),rep(2,9)),y=c(rep(1:6,each=3)),z=1:18)
- For each row i, I want to know, in the rows with x=x[i] and y=<y[i], the 50%tile (as well as other quantiles in further calculations, e.g. 10%tile, 5%tile) of z[i].
An expected output would be
c(2,2,2,3.5,3.5,3.5,5,5,5,11,11,11,12.5,12.5,12.5,14,14,14)
- For each row i, I want to know, in the rows with x=x[i] and y=<y[i], the mean of z[i].
An expected output would be (same as 1 in this dataset, but would be different in other datasets).
c(2,2,2,3.5,3.5,3.5,5,5,5,11,11,11,12.5,12.5,12.5,14,14,14)
I can write a function for it and use apply to loop it over each row. However, the dataset has more than 30,000,000 rows which would take days. Is there a quicker way to calculate it, in R data.table or tidyverse or other packages?
CodePudding user response:
Use a non-equi join in data.table
data[data, quantile(z, 0.5), on = .(x = x, y <=y), by = .EACHI]$V1
#[1] 2.0 2.0 2.0 3.5 3.5 3.5 5.0 5.0 5.0 11.0 11.0 11.0 12.5 12.5 12.5 14.0 14.0 14.0
If we want to create a column
data[data[unique(data[, .(x, y)]), quantile(z, 0.5),
on = .(x = x, y <=y), by = .EACHI], z_mean := V1, on = .(x, y)]
-output
> data
x y z z_mean
<num> <int> <int> <num>
1: 1 1 1 2.0
2: 1 1 2 2.0
3: 1 1 3 2.0
4: 1 2 4 3.5
5: 1 2 5 3.5
6: 1 2 6 3.5
7: 1 3 7 5.0
8: 1 3 8 5.0
9: 1 3 9 5.0
10: 2 4 10 11.0
11: 2 4 11 11.0
12: 2 4 12 11.0
13: 2 5 13 12.5
14: 2 5 14 12.5
15: 2 5 15 12.5
16: 2 6 16 14.0
17: 2 6 17 14.0
18: 2 6 18 14.0