Home > Mobile >  Joining dataframes of different dimensions with varying merge by criterion
Joining dataframes of different dimensions with varying merge by criterion

Time:10-05

Good evening, I am trying to merge a couple datasets and my normal tools in R are failing me tonight. Consider df1 and df2 below.

df1 = data.frame(a = c("a", "b", "c"),
                 b = c("1", "2", "3"),
                 c = c("x", "y", "z"))
df2 = data.frame(a = c("1", "b", "c", "d", "e"),
                 b = c("a", "2", "3", "4", "5"),
                 d = c("x2", "y2", "z2", "x3", "y3"))

In both cases, column a and b are supposed to act as grouping variables. For example, in df1, when a = a and b = 1, then c = x. Given the structure of the data I am working with, the actual order of a and b does not matter such that if a were to = 1 and b = a, then c will still equal x. Herein lies the problem, I would like to merge df1 with a new df, df2. df2, is similarly structured, but contains a new variable d. And, as can be seen df2 includes some a and b combinations that are backwards compared to A. In addition, B has some additional observations.

The desired dataframe I am looking for looks like this:

desired = data.frame(a = c("a", "b", "c"),
                      b = c("1", "2", "3"),
                      c = c("x", "y", "z"),
                     d = c("x2", "y2", "z2"))

As can be seen the original column structure from a b and c are preserved, and we have added in column D. However, we have not added any new observations.

I have tried using merge() with varying combinations of by.x, by.y.

I also tried using various left_join and inner_join but I keep on getting whaping data sets that still aren't handling the mismatch in the a/b columns.

Thanks for any thoughts or help you might be able to provide.

Cheers

CodePudding user response:

You can left_join df2 with df1 twice and use coalesce -

library(dplyr)

df1 %>% 
  left_join(df2, by = c("a"="a", "b"="b")) %>% 
  left_join(df2, by = c("a"="b", "b"="a")) %>% 
  mutate(
    d = coalesce(d.x, d.y)
  ) %>% 
  select(a,b,c,d)

  a b c  d
1 a 1 x x2
2 b 2 y y2
3 c 3 z z2

CodePudding user response:

Good morning. It appears, the actual order of a and b does matter. sort your df2, or maybe both.

df2[1:2] <- t(apply(df2[1:2], 1, sort, decreasing=TRUE))
merge(df1, df2)
#   a b c  d
# 1 a 1 x x2
# 2 b 2 y y2
# 3 c 3 z z2
  • Related