Home > Net >  Remove distinct values based on condition
Remove distinct values based on condition

Time:07-23

I have a dataset I am trying to remove duplicate values on but need to retain the rows where a condition is met. It looks like,

col1 col2
a    NA
a    1
b    1
c    1
d    1
d    2

If I just run the normal distinct functions I retain just the first value/row of the duplicates

col1 col2
a    NA
b    1
c    1
d    1

BUT - I need to retain

col1 col2
a    1
b    1
c    1
d    1

I have tried

df <- df %>% 
  group_by(col1) %>%
  top_n(1, col2)

But it seems to be removing extra rows within a larger dataset that does not represent duplicates from col1. It is somehow running it's own condition on col2 and removing extra beyond the duplicates.

In my real example col1 are serial #'s and col2 are dates. I am trying to remove NA's from col2 while also trying to preserve any that have the max date of potentially two date values (an older date and a newer date)

CodePudding user response:

We could group arrange and slice:

library(dplyr)

df %>% 
  group_by(col1) %>% 
  arrange(col2, .by_group = TRUE) %>% 
  slice(1)

This (for this example!!!) gives the same result using add_count:

library(dplyr)
df %>%
  add_count(col2) %>% 
  filter(n!=1) %>%
  select(-n)

 col1   col2
  <chr> <int>
1 a         1
2 b         1
3 c         1
4 d         1

CodePudding user response:

A possible solution:

library(dplyr)

df %>% 
  group_by(col1) %>% 
  filter(!is.na(col2) | n() == 1) %>% 
  slice_head(n = 1) %>% 
  ungroup

#> # A tibble: 4 × 2
#>   col1   col2
#>   <chr> <int>
#> 1 a         1
#> 2 b         1
#> 3 c         1
#> 4 d         1
  • Related