I have multiple datasets which I combined using rbind. The datasets contained original ID's and I attached a unique number for every dataset (e.g. all data from dataset 1 all have dfid =1 and every row from dataset 1 has an original ID which accumulates until the data from dataset 2. Then the id value starts from 1 again, but with dfid=2).
The datasets contain duplicate rows, which should be kept if they have a different ID, but there are also duplicates in other datasets. I want to keep the duplicates within a dataset (with the same dfid), but if they appear in another dataset (with a different dfid) the duplicates in that dataset should be removed. The row with the lowest dfid should be kept and the duplicate row from another dataset with a higher dfid should be removed.
library(tidyverse)
df1= head(iris,3)
df2= tail(iris,4)
df2 =rbind(df2, df1[rep(1), ])
df1 =rbind(df1, df1[rep(1), ])
df1 =df1%>%mutate(ID=1:nrow(df1)) %>%mutate(dfid=1)
df2 =df2%>%mutate(ID=1:nrow(df2))%>%mutate(dfid=2)
dfexamp =rbind(df1, df2)%>% as.data.frame(row.names = 1:nrow(.))
Sepal.Length Sepal.Width Petal.Length Petal.Width Species ID dfid
1 5.1 3.5 1.4 0.2 setosa 1 1
2 4.9 3.0 1.4 0.2 setosa 2 1
3 4.7 3.2 1.3 0.2 setosa 3 1
4 5.1 3.5 1.4 0.2 setosa 4 1
5 6.3 2.5 5.0 1.9 virginica 1 2
6 6.5 3.0 5.2 2.0 virginica 2 2
7 6.2 3.4 5.4 2.3 virginica 3 2
8 5.9 3.0 5.1 1.8 virginica 4 2
9 5.1 3.5 1.4 0.2 setosa 5 2
And my desired outcome would be:
Sepal.Length Sepal.Width Petal.Length Petal.Width Species ID dfid
1 5.1 3.5 1.4 0.2 setosa 1 1
2 4.9 3.0 1.4 0.2 setosa 2 1
3 4.7 3.2 1.3 0.2 setosa 3 1
4 5.1 3.5 1.4 0.2 setosa 4 1
5 6.3 2.5 5.0 1.9 virginica 1 2
6 6.5 3.0 5.2 2.0 virginica 2 2
7 6.2 3.4 5.4 2.3 virginica 3 2
8 5.9 3.0 5.1 1.8 virginica 4 2
I tried using group_by but this removes all the duplicates, or just the rows I want to keep (using vars(-ID)).
dfexamp %>% group_by_at(vars(-dfid, -ID)) %>%
filter(n() < 2)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species ID dfid
1 4.9 3 1.4 0.2 setosa 2 1
2 4.7 3.2 1.3 0.2 setosa 3 1
3 6.3 2.5 5 1.9 virginica 1 2
4 6.5 3 5.2 2 virginica 2 2
5 6.2 3.4 5.4 2.3 virginica 3 2
6 5.9 3 5.1 1.8 virginica 4 2
CodePudding user response:
Your attempt was extremely close; just include ID in the grouping (i.e. remove ID from the exclusion from the grouping ;) and then use summarise to select the first appearance of dfid:
dfexamp %>%
arrange(dfid, ID) %>%
group_by_at(vars(-dfid)) %>%
summarise_all(first)
Returns:
Sepal.Length Sepal.Width Petal.Length Petal.Width Species ID dfid
<dbl> <dbl> <dbl> <dbl> <fct> <int> <dbl>
1 4.7 3.2 1.3 0.2 setosa 3 1
2 4.9 3 1.4 0.2 setosa 2 1
3 5.1 3.5 1.4 0.2 setosa 1 1
4 5.1 3.5 1.4 0.2 setosa 4 1
5 5.9 3 5.1 1.8 virginica 3 2
6 6.2 3.4 5.4 2.3 virginica 2 2
7 6.5 3 5.2 2 virginica 1 2
CodePudding user response:
This should work:
library(tidyverse)
dfexamp %>%
mutate(key = as.integer(
as.factor(
apply(.[,names(.) != "dfid"], 1, str_c, collapse = ",")
)
)
) %>%
group_split(key) %>%
map(~ .x %>% filter(dfid == min(dfid))) %>%
bind_rows(.) %>%
select(-key)