Home > Software engineering >  Finding Identical Rows in Multiple Datasets
Finding Identical Rows in Multiple Datasets

Time:04-08

I am trying to find out if 3 datasets (df1, df2, df3) have any common rows (i.e. entire row is a duplicate).

I figured out how to do this for pairs of 2 datasets:

df1 = data.frame(id = c(1,2,3), names = c("john", "alex", "peter"))

df2 = data.frame(id = c(1,2,3), names = c("alex", "john", "peter"))

df3 = data.frame(id = c(1,2,3), names = c("peter", "john", "tim"))


library(dplyr)

inner_join(df1, df2)

inner_join(df1, df3)

inner_join(df2, df3)
  • Is it possible to do this for 3 datasets all at once?

The straightforward way does not seem to work:

inner_join(df1, df2, df3)
Error in `[.data.frame`(by, c("x", "y")) : undefined columns selected

I thought I had found a way to do this:

library(plyr)
join_all(list(df1, df2, df3), type='inner')

But this is telling me that there are no common rows (i.e. same id, same name) between these 3 dataframes:

Joining by: id, names
Joining by: id, names
[1] id    names
<0 rows> (or 0-length row.names)

This is not correct, seeing as in the example I created:

  • Row 3 for df1 and df2 are identical (id = 3,name = peter)
  • Row 2 for df2 and df3 are identical (id = 2, name= john)

I am trying to find a way to determine if these 3 datasets share any common rows. Can this be done in R?

Thank you!

CodePudding user response:

Does this count?

dfall<-bind_rows(df1,df2,df3)
dfall[duplicated(dfall),]
  id names
6  3 peter
8  2  john

CodePudding user response:

A possible solution (in case you want a dataframe as result, just pipe bind_rows at the end):

library(dplyr)

combn(paste0("df", 1:3), 2, simplify = F, \(x) inner_join(get(x[1]), get(x[2]))) 

#> Joining, by = c("id", "names")
#> Joining, by = c("id", "names")
#> Joining, by = c("id", "names")
#> [[1]]
#>   id names
#> 1  3 peter
#> 
#> [[2]]
#> [1] id    names
#> <0 rows> (or 0-length row.names)
#> 
#> [[3]]
#>   id names
#> 1  2  john

CodePudding user response:

You can do this using get_dupes from janitor package.

library(tidyverse)
library(janitor)

# Added a new column 'df_id' to identify the data frame
df1 = data.frame(id = c(1,2,3), names = c("john", "alex", "peter"), df_id = 1) 
df2 = data.frame(id = c(1,2,3), names = c("alex", "john", "peter"), df_id = 2)
df3 = data.frame(id = c(1,2,3), names = c("peter", "john", "tim"), df_id = 3)

# Bind dataframes
# Get duplicates
df1 %>% 
  bind_rows(df2) %>% 
  bind_rows(df3) %>% 
  get_dupes(c(id, names))

#>   id names dupe_count df_id
#> 1  2  john          2     2
#> 2  2  john          2     3
#> 3  3 peter          2     1
#> 4  3 peter          2     2
  • Related