So, I have two dataframes
- somewhat large (df1 ~= 20k rows & df2 ~= 1.5 million) - and I want to check if a value in df1
is between df2$low & df2$high
, but do it conditionally (to limit the number of checks) and only do the check if abs(df1$val-df2$val) < 2
. If the value in df1 is found to be within the df2 range then add into a new column with TRUE/FALSE
values.
df1
weight | low | high |
---|---|---|
94.99610 | 94.99608 | 94.99613 |
95.00561 | 95.00558 | 95.00566 |
df2
index | th_weight |
---|---|
1 | 94.996092 |
2 | 95.496336 |
3 | 95.509906 |
4 | 97.473292 |
5 | 100.519060 |
Desired output should be:
df1
weight | lower | upper | filter |
---|---|---|---|
94.99610 | 94.99608 | 94.99613 | TRUE |
95.00561 | 95.00558 | 95.00566 | FALSE |
So in that example the logic is that rows 4 and 5 in df2
would get omitted. Value in row 1 of df2
is within row 1 of df1$low & df1$high
, therefore df1[,filter := TRUE]
for that weight
I have tried multiple ways, with double looping into the data.table
, which, as expected, was the least efficient and I also tried to make a cartesian product of df1 x df2
with dplyr::full_join
but having memory issues there. Even splitting into chunks and appending into a file (so I can later easily manipulate the file via UNIX sed
or awk
) didn't help.
There might be a very easy way to do this and I might be completely off-track, so I apologise in advance.
CodePudding user response:
Use a non-equi join with data.table
- convert the first data to data.table (setDT
), create the filter
column as logical (FALSE
) values. Do a non-equi join, and assign (:=
) the filter
to TRUE
, which changes the FALSE
to TRUE
only when the condition (abs(weight - th_weight) < 2
) meets
library(data.table)
setDT(df1)[, filter := FALSE]
df1[df2, filter := abs(weight - th_weight) < 2,
on = .(low <= th_weight, high >= th_weight)]
-output
> df1
weight low high filter
<num> <num> <num> <lgcl>
1: 94.99610 94.99608 94.99613 TRUE
2: 95.00561 95.00558 95.00566 FALSE
data
df1 <- structure(list(weight = c(94.9961, 95.00561), low = c(94.99608,
95.00558), high = c(94.99613, 95.00566)), class = "data.frame", row.names = c(NA,
-2L))
df2 <- structure(list(index = 1:5, th_weight = c(94.996092, 95.496336,
95.509906, 97.473292, 100.51906)), class = "data.frame", row.names = c(NA,
-5L))