Home > Mobile >  Compare two dataframes based on first and last name separeted by row
Compare two dataframes based on first and last name separeted by row

Time:12-01

I have two dataframes organised like this.

df1 <- data.frame(lastname = c("Miller", "Smith", "Grey"),
                  firstname = c("John", "Jane", "Hans")
                  )

df2 <- data.frame(lastname =c("Smith", "Grey"),
                  firstname = c("Jane", "Hans")
                  )

df2 is not necessarily a subset of df1. Duplicated entries are also possible.

My goal is to keep a copy of df1 in which all entries occur represented in both dfs. Alternatively, I would like to end up with a subset of df1 with a new variable, indicating that the name is also element of df2. Can someone suggest a way to do this? A {dyplr}-attempt is totally fine.

Desired output for the the paticular simple case:

res <- data.frame(lastname = c("Smith", "Grey"), 
                  firstname = c("Jane", "Hans")
                 )

CodePudding user response:

Including the "alternatively" part of the question this is an approach with left_join. Adding a grouping variable grp to distinguish the 2 sets.

library(dplyr)

left_join(cbind(df1, grp = "A"), cbind(df2, grp = "B"), 
  c("lastname", "firstname"), suffix=c("_A", "_B"))
  lastname firstname grp_A grp_B
1   Miller      John     A  <NA>
2    Smith      Jane     A     B
3     Grey      Hans     A     B

or with base R merge

merge(cbind(df1, grp = "A"), cbind(df2, grp = "B"), 
  c("lastname", "firstname"), suffixes=c("_A", "_B"), all=T)
  lastname firstname grp_A grp_B
1     Grey      Hans     A     B
2   Miller      John     A  <NA>
3    Smith      Jane     A     B

To remove NA and compact the grps

na.omit(left_join(cbind(df1, grp = "A"), cbind(df2, grp = "B"), 
    c("lastname", "firstname"), suffix=c("_A", "_B"))) %>% 
  summarize(lastname, firstname, 
    grp = list(across(starts_with("grp"), ~ unique(.x))))
  lastname firstname  grp
1    Smith      Jane A, B
2     Grey      Hans A, B

The other part is simply

merge(df1, df2)
  lastname firstname
1     Grey      Hans
2    Smith      Jane
  • Related