Home > Software engineering >  Non-equi join of two tables
Non-equi join of two tables

Time:11-24

I have 2 dataframes where I need to find how many times the entries in mock$num fall within the range of x-y specified by the range dataframe.

id <- c(1:9)
num <- c(99,101,199,250,999,1500,3000,4000,5000)
mock <- data.frame(id, num)

x <- c(100,100,200,1000,4000,4000)
y <- c(198,200,300,2000,5000,5000)
range <- data.frame(x,y)

The desired output is shown below

id num check
1 99   0
2 101  2
3 199  1
4 250  1
5 999  0
6 1500 1
7 3000 0
8 4000 2
9 5000 2

which can be obtained by the following code

mock$check <- mapply(
    function(x){
        count = 0
        for (i in 1:nrow(range)){
            if (x >= range$x[i] & x <= range$y[i]){
                count = count   1
            }
        }
        paste0(count)
    },
    mock$num
)

The above approach is unsuitable for large datasets due to the for loop and I was trying to use a non-equi join method (via data.table). However, I am stuck as to how to do a non-equi join between two tables (not sure how to proceed to...)

nonequi <- mock[range, on =.(num >= x, num <=y),]

Would like to seek some guidance to this problem. Appreciate your help.

CodePudding user response:

First do non-equi join then summarise for each id how many times it appears and join back to mock to find out which ids were left out. For those ids set count to zero.

res <- setDT(mock)[setDT(range), .(id), on = .(num >= x, num <= y)][, .N, by = id][mock, on = .(id)][is.na(N), N := 0][]

CodePudding user response:

not sure about the execution speed for large data arrays

id <- c(1:9)
num <- c(99,101,199,250,999,1500,3000,4000,5000)
mock <- data.frame(id, num)

x <- c(100,100,200,1000,4000,4000)
y <- c(198,200,300,2000,5000,5000)
RANGE <- data.frame(x,y)

mock$check <- sapply(mock$num, function(z) sum(z >= RANGE$x & z <= RANGE$y))

mock
#>   id  num check
#> 1  1   99     0
#> 2  2  101     2
#> 3  3  199     1
#> 4  4  250     1
#> 5  5  999     0
#> 6  6 1500     1
#> 7  7 3000     0
#> 8  8 4000     2
#> 9  9 5000     2

Created on 2021-11-24 by the reprex package (v2.0.1)

  • Related