So let me provide some example to explain my question. I have two data df1 and df2. I want to left join two dataset. by satisfying two conditions.
(1) week are identical
(2) m1 and m2 in df1 are same as m1 and m2 in df2 but ignoring the column name
So the expect output is df3
df1<-data.frame("m1"=c("100010","100010","100010","100020","100020","100020"),"m2"=c("100020","100020","100020","100010","100010","100010"),"week"=c(1,2,3,1,1,3))
df2<-data.frame("m1"=c("100010","100010","100010"),"m2"=c("100020","100020","100020"),"week"=c(1,2,3),"freq"=c(3,1,2))
print(df1)
m1 m2 week
1 100010 100020 1
2 100010 100020 2
3 100010 100020 3
4 100020 100010 1
5 100020 100010 1
6 100020 100010 3
print(df2)
m1 m2 week freq
1 100010 100020 1 3
2 100010 100020 2 1
3 100010 100020 3 2
df3<- data.frame("m1"=c("100010","100010","100010","100020","100020","100020"),"m2"=c("100020","100020","100020","100010","100010","100010"),"week"=c(1,2,3,1,1,3),"freq"=c(3,1,2,3,3,2))
print(df3)
m1 m2 week freq
1 100010 100020 1 3
2 100010 100020 2 1
3 100010 100020 3 2
4 100020 100010 1 3
5 100020 100010 1 3
6 100020 100010 3 2
I tried separately merge but it created duplicated column for freq which is not desired. Is this anything else I can try? Many thanks!
CodePudding user response:
I suppose I would propose two ways of doing this depending on your preference. The first would be using SQL instead of R for the task. It’s a bit more straightforward for the type of join you’re describing.
library(sqldf)
library(dplyr)
df1<-data.frame("m1"=c("100010","100010","100010","100020","100020","100020"),"m2"=c("100020","100020","100020","100010","100010","100010"),"week"=c(1,2,3,1,1,3))
df2<-data.frame("m1"=c("100010","100010","100010"),"m2"=c("100020","100020","100020"),"week"=c(1,2,3),"freq"=c(3,1,2))
df3<- data.frame("m1"=c("100010","100010","100010","100020","100020","100020"),"m2"=c("100020","100020","100020","100010","100010","100010"),"week"=c(1,2,3,1,1,3),"freq"=c(3,1,2,3,3,2))
df_sql <-
sqldf::sqldf("SELECT a.*, b.freq
FROM df1 a
LEFT JOIN df2 b
ON (a.week = b.week and a.m1 = b.m1 and a.m2 = b.m2) OR
(a.week = b.week and a.m1 = b.m2 and a.m2 = b.m1)")
identical(df_sql, df3)
#> [1] TRUE
I am sure there are more elegant ways to do this, but the second strategy is just to duplicate df2
, rename the columns with m1
and m2
reversed, and then do the join.
df <-
df2 %>%
rename(m2 = m1, m1 = m2) %>%
bind_rows(df2, .) %>%
left_join(df1, ., by = c("week", "m1", "m2"))
identical(df, df3)
#> [1] TRUE
I imagine there are other ways that don’t involve a join, but that’s how I would do it using joins.
Created on 2022-02-17 by the reprex package (v2.0.1)
CodePudding user response:
If we want an OR
join, we could use regex_left_join
from fuzzyjoin
library(dplyr)
library(fuzzyjoin)
library(stringr)
regex_left_join(df1 %>%
mutate(m1m2 = str_c(m1, m2, sep = "|")),
df2 %>%
mutate(m1m2 = str_c(m1, m2, sep = "|"), .keep = "unused"),
by = c("m1m2", "week")) %>%
select(m1, m2, week = week.x, freq)
-output
m1 m2 week freq
1 100010 100020 1 3
2 100010 100020 2 1
3 100010 100020 3 2
4 100020 100010 1 3
5 100020 100010 1 3
6 100020 100010 3 2