Home > Back-end >  Join two data frame by considering if values of paired columns are in range of the value of paired c
Join two data frame by considering if values of paired columns are in range of the value of paired c

Time:02-19

So let me provide some example to explain my question. I have two data df1 and df2. I want to left join two dataset. by satisfying two conditions.

(1) week are identical
(2) m1 and m2 in df1 are same as m1 and m2 in df2 but ignoring the column name

So the expect output is df3


 df1<-data.frame("m1"=c("100010","100010","100010","100020","100020","100020"),"m2"=c("100020","100020","100020","100010","100010","100010"),"week"=c(1,2,3,1,1,3))

df2<-data.frame("m1"=c("100010","100010","100010"),"m2"=c("100020","100020","100020"),"week"=c(1,2,3),"freq"=c(3,1,2)) 

print(df1)
      m1     m2 week
1 100010 100020    1
2 100010 100020    2
3 100010 100020    3
4 100020 100010    1
5 100020 100010    1
6 100020 100010    3
print(df2)
      m1     m2 week freq
1 100010 100020    1    3
2 100010 100020    2    1
3 100010 100020    3    2
df3<- data.frame("m1"=c("100010","100010","100010","100020","100020","100020"),"m2"=c("100020","100020","100020","100010","100010","100010"),"week"=c(1,2,3,1,1,3),"freq"=c(3,1,2,3,3,2))
print(df3)
      m1     m2 week freq
1 100010 100020    1    3
2 100010 100020    2    1
3 100010 100020    3    2
4 100020 100010    1    3
5 100020 100010    1    3
6 100020 100010    3    2

I tried separately merge but it created duplicated column for freq which is not desired. Is this anything else I can try? Many thanks!

CodePudding user response:

I suppose I would propose two ways of doing this depending on your preference. The first would be using SQL instead of R for the task. It’s a bit more straightforward for the type of join you’re describing.

library(sqldf)
library(dplyr)

df1<-data.frame("m1"=c("100010","100010","100010","100020","100020","100020"),"m2"=c("100020","100020","100020","100010","100010","100010"),"week"=c(1,2,3,1,1,3))
df2<-data.frame("m1"=c("100010","100010","100010"),"m2"=c("100020","100020","100020"),"week"=c(1,2,3),"freq"=c(3,1,2)) 
df3<- data.frame("m1"=c("100010","100010","100010","100020","100020","100020"),"m2"=c("100020","100020","100020","100010","100010","100010"),"week"=c(1,2,3,1,1,3),"freq"=c(3,1,2,3,3,2))

df_sql <- 
  sqldf::sqldf("SELECT a.*, b.freq
               FROM df1 a
               LEFT JOIN df2 b 
               ON (a.week = b.week and a.m1 = b.m1 and a.m2 = b.m2) OR
                  (a.week = b.week and a.m1 = b.m2 and a.m2 = b.m1)")

identical(df_sql, df3)
#> [1] TRUE

I am sure there are more elegant ways to do this, but the second strategy is just to duplicate df2, rename the columns with m1 and m2 reversed, and then do the join.

df <-
  df2 %>%
  rename(m2 = m1, m1 = m2) %>%
  bind_rows(df2, .) %>%
  left_join(df1, ., by = c("week", "m1", "m2"))


identical(df, df3)
#> [1] TRUE

I imagine there are other ways that don’t involve a join, but that’s how I would do it using joins.

Created on 2022-02-17 by the reprex package (v2.0.1)

CodePudding user response:

If we want an OR join, we could use regex_left_join from fuzzyjoin

library(dplyr)
library(fuzzyjoin)
library(stringr)
regex_left_join(df1 %>% 
   mutate(m1m2 = str_c(m1, m2, sep = "|")), 
   df2 %>%
    mutate(m1m2 = str_c(m1, m2, sep = "|"), .keep = "unused"), 
     by = c("m1m2", "week")) %>% 
   select(m1, m2, week = week.x, freq)

-output

      m1     m2 week freq
1 100010 100020    1    3
2 100010 100020    2    1
3 100010 100020    3    2
4 100020 100010    1    3
5 100020 100010    1    3
6 100020 100010    3    2
  • Related