I want to remove duplicate entries from a data frame that are inconsistent, the following gives a simplified example:
df <- data.frame(name = c("Andy", "Bert", "Cindy", "Cindy", "David", "Edgar", "Edgar", "Frank", "George", "George", "George", "Herbert", "Iris", "Iris", "Iris"), amount = c(100, 50, 30, 30, 200, 65, 55, 90, 120, 120, 120, 300, 15, 25, 25))
df
## name amount
## 1 Andy 100
## 2 Bert 50
## 3 Cindy 30
## 4 Cindy 30
## 5 David 200
## 6 Edgar 65
## 7 Edgar 55
## 8 Frank 90
## 9 George 120
## 10 George 120
## 11 George 120
## 12 Herbert 300
## 13 Iris 15
## 14 Iris 25
## 15 Iris 25
Version A)
Edgar and Iris are the same person yet the given amounts are inconsistent so I want to remove the entries:
#remove inconsistent duplicate entries
df2
## name amount
## 1 Andy 100
## 2 Bert 50
## 3 Cindy 30
## 4 Cindy 30
## 5 David 200
## 6 Frank 90
## 7 George 120
## 8 George 120
## 9 George 120
## 10 Herbert 300
Version B)
Another possibility would be to keep only one instance of the consistent entries:
#keep only one instance of consistent entries
df3
## name amount
## 1 Andy 100
## 2 Bert 50
## 3 Cindy 30
## 4 David 200
## 5 Frank 90
## 6 George 120
## 7 Herbert 300
I am interested in (elegant?) ways to solve both versions in Base R. Efficiency should not be a problem because the dataset I have is not that huge.
CodePudding user response:
A base
solution that solves both at once. This has the side effect of requiring row name changes.
A Remove "inconsistent" values
new_df<-do.call("rbind",
Filter(function(x) all(x$amount == x$amount[1]),
split(df,df$name)))
name amount
Andy Andy 100
Bert Bert 50
Cindy.3 Cindy 30
Cindy.4 Cindy 30
David David 200
Frank Frank 90
George.9 George 120
George.10 George 120
George.11 George 120
Herbert Herbert 300
The above needs further cleaning of row names (an unwanted side effect perhaps but we deal with that below)
B Remove duplicates
new_df<-new_df[!duplicated(new_df$name),]
row.names(new_df) <- 1:nrow(new_df)
Combined result
new_df
name amount
1 Andy 100
2 Bert 50
3 Cindy 30
4 David 200
5 Frank 90
6 George 120
7 Herbert 300
The question specifically requests for a base
solution. If for whatever reason someone from the future wants to use dplyr
, I will leave this solution here.
Using dplyr
, we can check if all values are equal to the first value of amount. If not, make them NA
and delete them. Proceed with removing duplicates for what remains.
A Remove inconsistent ones
library(dplyr)
(df %>%
group_by(name) %>%
mutate(name = ifelse(!all(amount==first(amount)), NA, name)) %>%
na.omit() -> new_df)
A tibble: 10 x 2
# Groups: name [7]
name amount
<chr> <dbl>
1 Andy 100
2 Bert 50
3 Cindy 30
4 Cindy 30
5 David 200
6 Frank 90
7 George 120
8 George 120
9 George 120
10 Herbert 300
Remove duplicates
new_df %>%
filter(!duplicated(name)) %>%
ungroup()
# A tibble: 7 x 2
name amount
<chr> <dbl>
1 Andy 100
2 Bert 50
3 Cindy 30
4 David 200
5 Frank 90
6 George 120
7 Herbert 300
CodePudding user response:
You can use duplicate
, but you need to remove all duplicate rows. (your option B).
The result can be used to filter the data frame for all rows.
df <- data.frame(name = c("Andy", "Bert", "Cindy", "Cindy", "David", "Edgar", "Edgar", "Frank", "George", "George", "George", "Herbert", "Iris", "Iris", "Iris"), amount = c(100, 50, 30, 30, 200, 65, 55, 90, 120, 120, 120, 300, 15, 25, 25))
df_unq <- unique(df)
df3 <- df_unq[!(duplicated(df_unq$name)|duplicated(df_unq$name, fromLast = TRUE)), ]
df3
#> name amount
#> 1 Andy 100
#> 2 Bert 50
#> 3 Cindy 30
#> 5 David 200
#> 8 Frank 90
#> 9 George 120
#> 12 Herbert 300
df[df$name %in% df3$name, ]
#> name amount
#> 1 Andy 100
#> 2 Bert 50
#> 3 Cindy 30
#> 4 Cindy 30
#> 5 David 200
#> 8 Frank 90
#> 9 George 120
#> 10 George 120
#> 11 George 120
#> 12 Herbert 300
Created on 2021-12-12 by the reprex package (v2.0.1)
CodePudding user response:
For the first requirement, where you need to get rid of duplicate entries, there's an in-built function in R called duplicated
.
Here's the code:
df[!duplicated(df), ]
df[!duplicated(df$name),]
The output looks like this:
name amount
1 Andy 100
2 Bert 50
3 Cindy 30
5 David 200
6 Edgar 65
8 Frank 90
9 George 120
12 Herbert 300
13 Iris 15
And for the second requirement, you'll need to do something like this:
df <- unique(df)
df <- split(df, df$name)
df <- df[sapply(df, nrow) == 1]
df <- do.call(rbind, df)
rownames(df) <- 1:nrow(df)
The output looks like this:
name amount
1 Andy 100
2 Bert 50
3 Cindy 30
4 David 200
5 Frank 90
6 George 120
7 Herbert 300
Both versions are using Base-R. You can do the same using dplyr
package in R.
CodePudding user response:
A) First aggregate
to apply the conditions, then filter the data and finally stack
the result.
t <- aggregate( amount ~ name, df, function(x) c(unique(x),length(x)) )
t_m <- t[!sapply( t$amount, function(x) (length(x)>2) ),]
setNames( stack( setNames(lapply( t_m$amount, function(x)
rep(x[1],x[2]) ), t_m$name) )[,c("ind", "values")], colnames(df) )
name amount
1 Andy 100
2 Bert 50
3 Cindy 30
4 Cindy 30
5 David 200
6 Frank 90
7 George 120
8 George 120
9 George 120
10 Herbert 300
B) Is a bit more straightforward. Just aggregate
and filter.
t <- aggregate( amount ~ name, df, unique )
t[lengths(t$amount) == 1,]
name amount
1 Andy 100
2 Bert 50
3 Cindy 30
4 David 200
6 Frank 90
7 George 120
8 Herbert 300