I have 2 dataframes where I need to find how many times the entries in mock$num fall within the range of x-y specified by the range dataframe.
id <- c(1:9)
num <- c(99,101,199,250,999,1500,3000,4000,5000)
mock <- data.frame(id, num)
x <- c(100,100,200,1000,4000,4000)
y <- c(198,200,300,2000,5000,5000)
range <- data.frame(x,y)
The desired output is shown below
id num check
1 99 0
2 101 2
3 199 1
4 250 1
5 999 0
6 1500 1
7 3000 0
8 4000 2
9 5000 2
which can be obtained by the following code
mock$check <- mapply(
function(x){
count = 0
for (i in 1:nrow(range)){
if (x >= range$x[i] & x <= range$y[i]){
count = count 1
}
}
paste0(count)
},
mock$num
)
The above approach is unsuitable for large datasets due to the for loop and I was trying to use a non-equi join method (via data.table
). However, I am stuck as to how to do a non-equi join between two tables (not sure how to proceed to...)
nonequi <- mock[range, on =.(num >= x, num <=y),]
Would like to seek some guidance to this problem. Appreciate your help.
CodePudding user response:
First do non-equi join then summarise for each id
how many times it appears and join back to mock
to find out which ids were left out. For those ids set count to zero.
res <- setDT(mock)[setDT(range), .(id), on = .(num >= x, num <= y)][, .N, by = id][mock, on = .(id)][is.na(N), N := 0][]
CodePudding user response:
not sure about the execution speed for large data arrays
id <- c(1:9)
num <- c(99,101,199,250,999,1500,3000,4000,5000)
mock <- data.frame(id, num)
x <- c(100,100,200,1000,4000,4000)
y <- c(198,200,300,2000,5000,5000)
RANGE <- data.frame(x,y)
mock$check <- sapply(mock$num, function(z) sum(z >= RANGE$x & z <= RANGE$y))
mock
#> id num check
#> 1 1 99 0
#> 2 2 101 2
#> 3 3 199 1
#> 4 4 250 1
#> 5 5 999 0
#> 6 6 1500 1
#> 7 7 3000 0
#> 8 8 4000 2
#> 9 9 5000 2
Created on 2021-11-24 by the reprex package (v2.0.1)