I am working with two datasets that I would like to join based not exact matches between them, but rather approximate matches. My question is similar to this OP.
Here are examples of what my two dataframes look like.
df1
is this one:
x
4.8
12
4
3.5
12.5
18
df2
is this one:
x y
4.8 6.6
12 1
4.5 1
3.5 0.5
13 1.8
15 2
I am currently using inner_join(df1, df2, by=c("x")
to join the two together.
This gives me:
x y
4.8 6.6
12 1
3.5 0.5
However, what I really want to do is join the two df
s based on these conditions:
- any exact matches are joined first (exactly like how
inner_join()
currently works) - BUT, if there are no exact matches, then join to any match ± 0.5
The kind of output I am trying to get would look like this:
x y
4.8 6.6
12 1
4 1 #the y value is from x=4.5 in df1
4 0.5 #the y value is from x=3.5 in df1
3.5 0.5
12.5 1 #the y value is from x=12 in df1
12.5 1.8 #the y value is from x=13 in df1
I typically work in dplyr
, so a dplyr
solution would be appreciated. But, I am also open to other suggestions because I don't know if dplyr
will be flexible enough to do a "fuzzy" join.
(I am aware of the fuzzyjoin
package, but it doesn't seem to get at exactly what I am trying to do here)
CodePudding user response:
A possible solution, with no join
:
library(tidyverse)
df1 %>%
rename(x1 = x) %>%
crossing(df2) %>%
mutate(diff = abs(x1-x)) %>%
filter(diff <= 0.5) %>%
group_by(x1) %>%
mutate(aux = any(diff == 0)) %>%
filter(aux*(diff == 0) | !aux) %>%
select(-diff, -aux) %>%
ungroup
#> # A tibble: 7 × 3
#> x1 x y
#> <dbl> <dbl> <dbl>
#> 1 3.5 3.5 0.5
#> 2 4 3.5 0.5
#> 3 4 4.5 1
#> 4 4.8 4.8 6.6
#> 5 12 12 1
#> 6 12.5 12 1
#> 7 12.5 13 1.8