R Obtain quantile and mean from a tailored subset in the dataset-CodePudding

I would like to obtain quantile in a tailored subset. For example in the following dataset:

data = data.table(x=c(rep(1,9),rep(2,9)),y=c(rep(1:6,each=3)),z=1:18)

For each row i, I want to know, in the rows with x=x[i] and y=<y[i], the 50%tile (as well as other quantiles in further calculations, e.g. 10%tile, 5%tile) of z[i].

An expected output would be

c(2,2,2,3.5,3.5,3.5,5,5,5,11,11,11,12.5,12.5,12.5,14,14,14)

For each row i, I want to know, in the rows with x=x[i] and y=<y[i], the mean of z[i].

An expected output would be (same as 1 in this dataset, but would be different in other datasets).

c(2,2,2,3.5,3.5,3.5,5,5,5,11,11,11,12.5,12.5,12.5,14,14,14)

I can write a function for it and use apply to loop it over each row. However, the dataset has more than 30,000,000 rows which would take days. Is there a quicker way to calculate it, in R data.table or tidyverse or other packages?

CodePudding user response：

Use a non-equi join in data.table

data[data, quantile(z, 0.5), on = .(x = x, y <=y), by = .EACHI]$V1
#[1]  2.0  2.0  2.0  3.5  3.5  3.5  5.0  5.0  5.0 11.0 11.0 11.0 12.5 12.5 12.5 14.0 14.0 14.0

If we want to create a column

data[data[unique(data[, .(x, y)]), quantile(z, 0.5), 
  on = .(x = x, y <=y), by = .EACHI], z_mean := V1, on = .(x, y)]

-output

> data
        x     y     z z_mean
    <num> <int> <int>  <num>
 1:     1     1     1    2.0
 2:     1     1     2    2.0
 3:     1     1     3    2.0
 4:     1     2     4    3.5
 5:     1     2     5    3.5
 6:     1     2     6    3.5
 7:     1     3     7    5.0
 8:     1     3     8    5.0
 9:     1     3     9    5.0
10:     2     4    10   11.0
11:     2     4    11   11.0
12:     2     4    12   11.0
13:     2     5    13   12.5
14:     2     5    14   12.5
15:     2     5    15   12.5
16:     2     6    16   14.0
17:     2     6    17   14.0
18:     2     6    18   14.0