Check if a value in a dataframe is conditionally between a range of values specified by two columns-CodePudding

So, I have two dataframes - somewhat large (df1 ~= 20k rows & df2 ~= 1.5 million) - and I want to check if a value in df1 is between df2$low & df2$high, but do it conditionally (to limit the number of checks) and only do the check if abs(df1$val-df2$val) < 2. If the value in df1 is found to be within the df2 range then add into a new column with TRUE/FALSE values.

df1

weight	low	high
94.99610	94.99608	94.99613
95.00561	95.00558	95.00566

df2

index	th_weight
1	94.996092
2	95.496336
3	95.509906
4	97.473292
5	100.519060

Desired output should be:

df1

weight	lower	upper	filter
94.99610	94.99608	94.99613	TRUE
95.00561	95.00558	95.00566	FALSE

So in that example the logic is that rows 4 and 5 in df2 would get omitted. Value in row 1 of df2 is within row 1 of df1$low & df1$high, therefore df1[,filter := TRUE] for that weight

I have tried multiple ways, with double looping into the data.table, which, as expected, was the least efficient and I also tried to make a cartesian product of df1 x df2 with dplyr::full_join but having memory issues there. Even splitting into chunks and appending into a file (so I can later easily manipulate the file via UNIX sed or awk) didn't help.

There might be a very easy way to do this and I might be completely off-track, so I apologise in advance.

CodePudding user response：

Use a non-equi join with data.table - convert the first data to data.table (setDT), create the filter column as logical (FALSE) values. Do a non-equi join, and assign (:=) the filter to TRUE, which changes the FALSE to TRUE only when the condition (abs(weight - th_weight) < 2) meets

library(data.table)
setDT(df1)[, filter := FALSE]
df1[df2, filter := abs(weight - th_weight) < 2, 
       on = .(low <= th_weight, high >= th_weight)]

-output

> df1
     weight      low     high filter
      <num>    <num>    <num> <lgcl>
1: 94.99610 94.99608 94.99613   TRUE
2: 95.00561 95.00558 95.00566  FALSE

data

df1 <- structure(list(weight = c(94.9961, 95.00561), low = c(94.99608, 
95.00558), high = c(94.99613, 95.00566)), class = "data.frame", row.names = c(NA, 
-2L))

df2 <- structure(list(index = 1:5, th_weight = c(94.996092, 95.496336, 
95.509906, 97.473292, 100.51906)), class = "data.frame", row.names = c(NA, 
-5L))