I would like to count how many rows having a larger number in a tailored subset. For example in the following dataset:
data = data.table(x=c(rep(1,8),rep(2,8)),y=c(rep(1:8,each=2)),z=c(1,2,3,4,5,4,3,2,1,2,3,4,5,4,3,4))
For each row i, I want to know, in the rows with x=x[i] and y<y[i], how many of them have z greater than z[i].
An expected output would be
c(0,0,0,0,0,0,3,4,0,0,0,0,0,0,3,1)
I can write a function for it and use apply to loop it over each row. However, the dataset has more than 30,000,000 rows which would take days. Is there a quicker way to calculate it, in R data.table or tidyverse or other packages?
CodePudding user response:
We can use a non-equi join for this
library(data.table)
data[data, .N, on = .(x = x, y < y, z > z), by = .EACHI]
x y z N
<num> <int> <num> <int>
1: 1 1 1 0
2: 1 1 2 0
3: 1 2 3 0
4: 1 2 4 0
5: 1 3 5 0
6: 1 3 4 0
7: 1 4 3 3
8: 1 4 2 4
9: 2 5 1 0
10: 2 5 2 0
11: 2 6 3 0
12: 2 6 4 0
13: 2 7 5 0
14: 2 7 4 0
15: 2 8 3 3
16: 2 8 4 1