Home > database >  How to filter/sort duplicates based on a text/factor column?
How to filter/sort duplicates based on a text/factor column?

Time:10-22

I have a data frame that has a column with unique names and then another column that has factors/char (can do either). I want to be able to filter out duplicate names and IF one of the duplicates has a certain value for a factor I want to keep that one, if there are two duplicates with the same factor, then I don't care which one it keeps.

   Name        Status
1. John          A
2. John          B
3. Sally         A
4. Alex          A
5. Sarah         B
6. Joe           A
7. Joe           A
8. Sue           B
9. Sue           B

I want to keep the duplicate if the factor/char is set to B. If there are two As or two Bs I don't care which one it keeps.

This is the result I want:

1. John          B
2. Sally         A
3. Alex          A
4. Sarah         B
5. Joe           A
6. Sue           B

I have tried the following but it's still keeping both A and B for John:

Name <- c("John","John","Sally","Alex", "Sarah", "Joe", "Joe", "Sue", "Sue")
Status <- c('A', 'B', 'A', 'A', 'B', 'A', 'A', 'B', 'B')
df_reddit <- data.frame(Name,Status)

df_reddit[, 'Status'] <- as.factor(df_reddit[, 'Status'])

df_reddit$Status <- factor(df_reddit$Status, levels = c("A", "B"))

df_reddit <- df_reddit[order(df_reddit$Status),]

df_reddit[!duplicated(df_reddit[,c('Name')]),]

Any help is appreciated! Would a loop or something be better for this?

CodePudding user response:

A pure dplyr approach may look like so:

library(dplyr)

df_reddit |> 
  add_count(Name, wt = Status == "B") |> 
  filter(Status == "B" | n == 0) |> 
  distinct(Name, Status)
#>    Name Status
#> 1  John      B
#> 2 Sally      A
#> 3  Alex      A
#> 4 Sarah      B
#> 5   Joe      A
#> 6   Sue      B

CodePudding user response:

Here's one option using dplyr and data.table:

library(dplyr)
library(data.table)

df_reddit %>% 
  group_by(Name) %>% 
  mutate(rle = data.table::rleid(Status)) %>% 
  filter(rle == max(rle)) %>% 
  select(-rle) %>% 
  unique() %>%
  ungroup()
# A tibble: 6 × 2
  Name  Status
  <chr> <chr> 
1 John  B     
2 Sally A     
3 Alex  A     
4 Sarah B     
5 Joe   A     
6 Sue   B   
  • Related