Home > Mobile >  Remove duplicates from merged dataframe while keeping duplicates within original datasets
Remove duplicates from merged dataframe while keeping duplicates within original datasets

Time:10-11

I have multiple datasets which I combined using rbind. The datasets contained original ID's and I attached a unique number for every dataset (e.g. all data from dataset 1 all have dfid =1 and every row from dataset 1 has an original ID which accumulates until the data from dataset 2. Then the id value starts from 1 again, but with dfid=2).

The datasets contain duplicate rows, which should be kept if they have a different ID, but there are also duplicates in other datasets. I want to keep the duplicates within a dataset (with the same dfid), but if they appear in another dataset (with a different dfid) the duplicates in that dataset should be removed. The row with the lowest dfid should be kept and the duplicate row from another dataset with a higher dfid should be removed.

library(tidyverse)
df1= head(iris,3)
df2= tail(iris,4)
df2 =rbind(df2, df1[rep(1), ]) 
df1 =rbind(df1, df1[rep(1), ]) 
df1 =df1%>%mutate(ID=1:nrow(df1)) %>%mutate(dfid=1)
df2 =df2%>%mutate(ID=1:nrow(df2))%>%mutate(dfid=2)

dfexamp =rbind(df1, df2)%>% as.data.frame(row.names = 1:nrow(.))

  Sepal.Length Sepal.Width Petal.Length Petal.Width   Species ID dfid
1          5.1         3.5          1.4         0.2    setosa  1    1
2          4.9         3.0          1.4         0.2    setosa  2    1
3          4.7         3.2          1.3         0.2    setosa  3    1
4          5.1         3.5          1.4         0.2    setosa  4    1
5          6.3         2.5          5.0         1.9 virginica  1    2
6          6.5         3.0          5.2         2.0 virginica  2    2
7          6.2         3.4          5.4         2.3 virginica  3    2
8          5.9         3.0          5.1         1.8 virginica  4    2
9          5.1         3.5          1.4         0.2    setosa  5    2

And my desired outcome would be:

  Sepal.Length Sepal.Width Petal.Length Petal.Width   Species ID dfid
1          5.1         3.5          1.4         0.2    setosa  1    1
2          4.9         3.0          1.4         0.2    setosa  2    1
3          4.7         3.2          1.3         0.2    setosa  3    1
4          5.1         3.5          1.4         0.2    setosa  4    1
5          6.3         2.5          5.0         1.9 virginica  1    2
6          6.5         3.0          5.2         2.0 virginica  2    2
7          6.2         3.4          5.4         2.3 virginica  3    2
8          5.9         3.0          5.1         1.8 virginica  4    2

I tried using group_by but this removes all the duplicates, or just the rows I want to keep (using vars(-ID)).

dfexamp %>% group_by_at(vars(-dfid, -ID)) %>% 
  filter(n() < 2)

  Sepal.Length Sepal.Width Petal.Length Petal.Width Species      ID  dfid
1          4.9         3            1.4         0.2 setosa        2     1
2          4.7         3.2          1.3         0.2 setosa        3     1
3          6.3         2.5          5           1.9 virginica     1     2
4          6.5         3            5.2         2   virginica     2     2
5          6.2         3.4          5.4         2.3 virginica     3     2
6          5.9         3            5.1         1.8 virginica     4     2

CodePudding user response:

Your attempt was extremely close; just include ID in the grouping (i.e. remove ID from the exclusion from the grouping ;) and then use summarise to select the first appearance of dfid:

dfexamp %>% 
  arrange(dfid, ID) %>% 
  group_by_at(vars(-dfid)) %>% 
  summarise_all(first)

Returns:

  Sepal.Length Sepal.Width Petal.Length Petal.Width Species      ID  dfid
         <dbl>       <dbl>        <dbl>       <dbl> <fct>     <int> <dbl>
1          4.7         3.2          1.3         0.2 setosa        3     1
2          4.9         3            1.4         0.2 setosa        2     1
3          5.1         3.5          1.4         0.2 setosa        1     1
4          5.1         3.5          1.4         0.2 setosa        4     1
5          5.9         3            5.1         1.8 virginica     3     2
6          6.2         3.4          5.4         2.3 virginica     2     2
7          6.5         3            5.2         2   virginica     1     2

CodePudding user response:

This should work:

library(tidyverse)
dfexamp %>% 
  mutate(key = as.integer(
    as.factor(
      apply(.[,names(.) != "dfid"], 1, str_c, collapse = ",")
      )
    )
  ) %>% 
  group_split(key) %>% 
  map(~  .x %>% filter(dfid == min(dfid))) %>% 
  bind_rows(.) %>% 
  select(-key)
  • Related