I am trying to find out if 3 datasets (df1, df2, df3) have any common rows (i.e. entire row is a duplicate).
I figured out how to do this for pairs of 2 datasets:
df1 = data.frame(id = c(1,2,3), names = c("john", "alex", "peter"))
df2 = data.frame(id = c(1,2,3), names = c("alex", "john", "peter"))
df3 = data.frame(id = c(1,2,3), names = c("peter", "john", "tim"))
library(dplyr)
inner_join(df1, df2)
inner_join(df1, df3)
inner_join(df2, df3)
- Is it possible to do this for 3 datasets all at once?
The straightforward way does not seem to work:
inner_join(df1, df2, df3)
Error in `[.data.frame`(by, c("x", "y")) : undefined columns selected
I thought I had found a way to do this:
library(plyr)
join_all(list(df1, df2, df3), type='inner')
But this is telling me that there are no common rows (i.e. same id, same name) between these 3 dataframes:
Joining by: id, names
Joining by: id, names
[1] id names
<0 rows> (or 0-length row.names)
This is not correct, seeing as in the example I created:
- Row 3 for df1 and df2 are identical (id = 3,name = peter)
- Row 2 for df2 and df3 are identical (id = 2, name= john)
I am trying to find a way to determine if these 3 datasets share any common rows. Can this be done in R?
Thank you!
CodePudding user response:
Does this count?
dfall<-bind_rows(df1,df2,df3)
dfall[duplicated(dfall),]
id names
6 3 peter
8 2 john
CodePudding user response:
A possible solution (in case you want a dataframe as result, just pipe bind_rows
at the end):
library(dplyr)
combn(paste0("df", 1:3), 2, simplify = F, \(x) inner_join(get(x[1]), get(x[2])))
#> Joining, by = c("id", "names")
#> Joining, by = c("id", "names")
#> Joining, by = c("id", "names")
#> [[1]]
#> id names
#> 1 3 peter
#>
#> [[2]]
#> [1] id names
#> <0 rows> (or 0-length row.names)
#>
#> [[3]]
#> id names
#> 1 2 john
CodePudding user response:
You can do this using get_dupes
from janitor
package.
library(tidyverse)
library(janitor)
# Added a new column 'df_id' to identify the data frame
df1 = data.frame(id = c(1,2,3), names = c("john", "alex", "peter"), df_id = 1)
df2 = data.frame(id = c(1,2,3), names = c("alex", "john", "peter"), df_id = 2)
df3 = data.frame(id = c(1,2,3), names = c("peter", "john", "tim"), df_id = 3)
# Bind dataframes
# Get duplicates
df1 %>%
bind_rows(df2) %>%
bind_rows(df3) %>%
get_dupes(c(id, names))
#> id names dupe_count df_id
#> 1 2 john 2 2
#> 2 2 john 2 3
#> 3 3 peter 2 1
#> 4 3 peter 2 2