Home > Net >  R remove duplicates conditional on values of multiple columns
R remove duplicates conditional on values of multiple columns

Time:03-05

My question is related to these posts: R, Remove duplicate rows conditional on value of variable

Conditionally removing duplicates

But they do not fully answer it.

I have a dataset with duplicates in key_one.

A tibble: 6 x 7
key_one   key_two  year  month  stadium        time             age
    <int>   <int> <dbl>  <dbl> <chr>            <dbl>            <int>
1 0033333 3166626  2020      6 3                 37               65
2 0033333 9071236  2020      7 3                 16               65
3 4447777 6593655  2019      2 2                 20               90
4 4447777 6593655  2019      2 2                 20               90
5 5556666 6803714  2017      2 1                 13               69
6 5556666 6803714  2017      2 2                 13               69

I want to remove the duplicates as follows:

If "time" is not equal to each other, take the key with the smaller time (for 003333 that would be 16).

If "time" is equal to each other, take the key with the higher stadium (for 555666 that would be 2).

If "time" and "stadium" are equal to each other, take the first line of the duplicated keys (for 4447777 that would be line 3 of the dataset).

Any help would be much appreciated!

CodePudding user response:

We may arrange the data by 'key_one', 'time' and 'stadium' in descending order and get the distinct rows by 'key_one'

library(dplyr)
df1 %>%
   arrange(key_one, time, desc(stadium)) %>%
   distinct(key_one, .keep_all = TRUE)

-output

 key_one key_two year month stadium time age
2 0033333 9071236 2020     7       3   16  65
3 4447777 6593655 2019     2       2   20  90
6 5556666 6803714 2017     2       2   13  69

data

df1 <- structure(list(key_one = c("0033333", "0033333", "4447777", "4447777", 
"5556666", "5556666"), key_two = c(3166626L, 9071236L, 6593655L, 
6593655L, 6803714L, 6803714L), year = c(2020L, 2020L, 2019L, 
2019L, 2017L, 2017L), month = c(6L, 7L, 2L, 2L, 2L, 2L), stadium = c(3L, 
3L, 2L, 2L, 1L, 2L), time = c(37L, 16L, 20L, 20L, 13L, 13L), 
    age = c(65L, 65L, 90L, 90L, 69L, 69L)), row.names = c("1", 
"2", "3", "4", "5", "6"), class = "data.frame")

CodePudding user response:

library(dplyr)

df %>% 
  group_by(key_one) %>% 
  slice_min(time) %>% 
  slice_max(stadium) %>% 
  distinct()
  • Related