I have two dataframes organised like this.
df1 <- data.frame(lastname = c("Miller", "Smith", "Grey"),
firstname = c("John", "Jane", "Hans")
)
df2 <- data.frame(lastname =c("Smith", "Grey"),
firstname = c("Jane", "Hans")
)
df2
is not necessarily a subset of df1
. Duplicated entries are also possible.
My goal is to keep a copy of df1
in which all entries occur represented in both dfs. Alternatively, I would like to end up with a subset of df1
with a new variable, indicating that the name is also element of df2
.
Can someone suggest a way to do this? A {dyplr}
-attempt is totally fine.
Desired output for the the paticular simple case:
res <- data.frame(lastname = c("Smith", "Grey"),
firstname = c("Jane", "Hans")
)
CodePudding user response:
Including the "alternatively" part of the question this is an approach with left_join
. Adding a grouping variable grp to distinguish the 2 sets.
library(dplyr)
left_join(cbind(df1, grp = "A"), cbind(df2, grp = "B"),
c("lastname", "firstname"), suffix=c("_A", "_B"))
lastname firstname grp_A grp_B
1 Miller John A <NA>
2 Smith Jane A B
3 Grey Hans A B
or with base R merge
merge(cbind(df1, grp = "A"), cbind(df2, grp = "B"),
c("lastname", "firstname"), suffixes=c("_A", "_B"), all=T)
lastname firstname grp_A grp_B
1 Grey Hans A B
2 Miller John A <NA>
3 Smith Jane A B
To remove NA
and compact the grps
na.omit(left_join(cbind(df1, grp = "A"), cbind(df2, grp = "B"),
c("lastname", "firstname"), suffix=c("_A", "_B"))) %>%
summarize(lastname, firstname,
grp = list(across(starts_with("grp"), ~ unique(.x))))
lastname firstname grp
1 Smith Jane A, B
2 Grey Hans A, B
The other part is simply
merge(df1, df2)
lastname firstname
1 Grey Hans
2 Smith Jane