Home > Back-end >  r Compare a number to a tailored subset in the dataset
r Compare a number to a tailored subset in the dataset

Time:05-12

I would like to count how many rows having a larger number in a tailored subset. For example in the following dataset:

data = data.table(x=c(rep(1,8),rep(2,8)),y=c(rep(1:8,each=2)),z=c(1,2,3,4,5,4,3,2,1,2,3,4,5,4,3,4))

For each row i, I want to know, in the rows with x=x[i] and y<y[i], how many of them have z greater than z[i].

An expected output would be

c(0,0,0,0,0,0,3,4,0,0,0,0,0,0,3,1)

I can write a function for it and use apply to loop it over each row. However, the dataset has more than 30,000,000 rows which would take days. Is there a quicker way to calculate it, in R data.table or tidyverse or other packages?

CodePudding user response:

We can use a non-equi join for this

library(data.table)
data[data, .N, on = .(x = x, y < y, z > z), by = .EACHI]
        x     y     z     N
    <num> <int> <num> <int>
 1:     1     1     1     0
 2:     1     1     2     0
 3:     1     2     3     0
 4:     1     2     4     0
 5:     1     3     5     0
 6:     1     3     4     0
 7:     1     4     3     3
 8:     1     4     2     4
 9:     2     5     1     0
10:     2     5     2     0
11:     2     6     3     0
12:     2     6     4     0
13:     2     7     5     0
14:     2     7     4     0
15:     2     8     3     3
16:     2     8     4     1
  • Related