Home > Software design >  Remove inconsistent duplicate entries from data frame with Base R
Remove inconsistent duplicate entries from data frame with Base R

Time:12-13

I want to remove duplicate entries from a data frame that are inconsistent, the following gives a simplified example:

df <- data.frame(name = c("Andy", "Bert", "Cindy", "Cindy", "David", "Edgar", "Edgar", "Frank", "George", "George", "George", "Herbert", "Iris", "Iris", "Iris"), amount = c(100, 50, 30, 30, 200, 65, 55, 90, 120, 120, 120, 300, 15, 25, 25))
df
##       name amount
## 1     Andy    100
## 2     Bert     50
## 3    Cindy     30
## 4    Cindy     30
## 5    David    200
## 6    Edgar     65
## 7    Edgar     55
## 8    Frank     90
## 9   George    120
## 10  George    120
## 11  George    120
## 12 Herbert    300
## 13    Iris     15
## 14    Iris     25
## 15    Iris     25

Version A)
Edgar and Iris are the same person yet the given amounts are inconsistent so I want to remove the entries:

#remove inconsistent duplicate entries
df2
##       name amount
## 1     Andy    100
## 2     Bert     50
## 3    Cindy     30
## 4    Cindy     30
## 5    David    200
## 6    Frank     90
## 7   George    120
## 8   George    120
## 9   George    120
## 10 Herbert    300

Version B)
Another possibility would be to keep only one instance of the consistent entries:

#keep only one instance of consistent entries
df3
##      name amount
## 1    Andy    100
## 2    Bert     50
## 3   Cindy     30
## 4   David    200
## 5   Frank     90
## 6  George    120
## 7 Herbert    300

I am interested in (elegant?) ways to solve both versions in Base R. Efficiency should not be a problem because the dataset I have is not that huge.

CodePudding user response:

A base solution that solves both at once. This has the side effect of requiring row name changes.

A Remove "inconsistent" values

  new_df<-do.call("rbind",
            Filter(function(x) all(x$amount == x$amount[1]),
                    split(df,df$name)))

            name amount
Andy         Andy    100
Bert         Bert     50
Cindy.3     Cindy     30
Cindy.4     Cindy     30
David       David    200
Frank       Frank     90
George.9   George    120
George.10  George    120
George.11  George    120
Herbert   Herbert    300

The above needs further cleaning of row names (an unwanted side effect perhaps but we deal with that below)

B Remove duplicates

 new_df<-new_df[!duplicated(new_df$name),]
 row.names(new_df) <- 1:nrow(new_df)

Combined result

new_df
         name amount
    1    Andy    100
    2    Bert     50
    3   Cindy     30
    4   David    200
    5   Frank     90
    6  George    120
    7 Herbert    300

The question specifically requests for a base solution. If for whatever reason someone from the future wants to use dplyr, I will leave this solution here.

Using dplyr, we can check if all values are equal to the first value of amount. If not, make them NA and delete them. Proceed with removing duplicates for what remains.

A Remove inconsistent ones

library(dplyr)
(df %>% 
  group_by(name) %>% 
   mutate(name = ifelse(!all(amount==first(amount)), NA, name)) %>% 
   na.omit()  -> new_df)

 A tibble: 10 x 2
# Groups:   name [7]
   name    amount
   <chr>    <dbl>
 1 Andy       100
 2 Bert        50
 3 Cindy       30
 4 Cindy       30
 5 David      200
 6 Frank       90
 7 George     120
 8 George     120
 9 George     120
10 Herbert    300

Remove duplicates

new_df %>%
   filter(!duplicated(name)) %>% 
   ungroup()
# A tibble: 7 x 2
  name    amount
  <chr>    <dbl>
1 Andy       100
2 Bert        50
3 Cindy       30
4 David      200
5 Frank       90
6 George     120
7 Herbert    300

CodePudding user response:

You can use duplicate, but you need to remove all duplicate rows. (your option B).

The result can be used to filter the data frame for all rows.

df <- data.frame(name = c("Andy", "Bert", "Cindy", "Cindy", "David", "Edgar", "Edgar", "Frank", "George", "George", "George", "Herbert", "Iris", "Iris", "Iris"), amount = c(100, 50, 30, 30, 200, 65, 55, 90, 120, 120, 120, 300, 15, 25, 25))

df_unq <- unique(df)
df3 <- df_unq[!(duplicated(df_unq$name)|duplicated(df_unq$name, fromLast = TRUE)), ]

df3
#>       name amount
#> 1     Andy    100
#> 2     Bert     50
#> 3    Cindy     30
#> 5    David    200
#> 8    Frank     90
#> 9   George    120
#> 12 Herbert    300

df[df$name %in% df3$name, ]
#>       name amount
#> 1     Andy    100
#> 2     Bert     50
#> 3    Cindy     30
#> 4    Cindy     30
#> 5    David    200
#> 8    Frank     90
#> 9   George    120
#> 10  George    120
#> 11  George    120
#> 12 Herbert    300

Created on 2021-12-12 by the reprex package (v2.0.1)

CodePudding user response:

For the first requirement, where you need to get rid of duplicate entries, there's an in-built function in R called duplicated.

Here's the code:

df[!duplicated(df), ]
df[!duplicated(df$name),]

The output looks like this:

      name amount
1     Andy    100
2     Bert     50
3    Cindy     30
5    David    200
6    Edgar     65
8    Frank     90
9   George    120
12 Herbert    300
13    Iris     15

And for the second requirement, you'll need to do something like this:

df <- unique(df)
df <- split(df, df$name)

df <- df[sapply(df, nrow) == 1]
df <- do.call(rbind, df)
rownames(df) <- 1:nrow(df)

The output looks like this:

     name amount
1    Andy    100
2    Bert     50
3   Cindy     30
4   David    200
5   Frank     90
6  George    120
7 Herbert    300

Both versions are using Base-R. You can do the same using dplyr package in R.

CodePudding user response:

A) First aggregate to apply the conditions, then filter the data and finally stack the result.

t <- aggregate( amount ~ name, df, function(x) c(unique(x),length(x)) )

t_m <- t[!sapply( t$amount, function(x) (length(x)>2) ),]

setNames( stack( setNames(lapply( t_m$amount, function(x) 
  rep(x[1],x[2]) ), t_m$name) )[,c("ind", "values")], colnames(df) ) 
      name amount
1     Andy    100
2     Bert     50
3    Cindy     30
4    Cindy     30
5    David    200
6    Frank     90
7   George    120
8   George    120
9   George    120
10 Herbert    300

B) Is a bit more straightforward. Just aggregate and filter.

t <- aggregate( amount ~ name, df, unique )

t[lengths(t$amount) == 1,]
     name amount
1    Andy    100
2    Bert     50
3   Cindy     30
4   David    200
6   Frank     90
7  George    120
8 Herbert    300
  • Related