My question is related to these posts: R, Remove duplicate rows conditional on value of variable
Conditionally removing duplicates
But they do not fully answer it.
I have a dataset with duplicates in key_one.
A tibble: 6 x 7
key_one key_two year month stadium time age
<int> <int> <dbl> <dbl> <chr> <dbl> <int>
1 0033333 3166626 2020 6 3 37 65
2 0033333 9071236 2020 7 3 16 65
3 4447777 6593655 2019 2 2 20 90
4 4447777 6593655 2019 2 2 20 90
5 5556666 6803714 2017 2 1 13 69
6 5556666 6803714 2017 2 2 13 69
I want to remove the duplicates as follows:
If "time" is not equal to each other, take the key with the smaller time (for 003333 that would be 16).
If "time" is equal to each other, take the key with the higher stadium (for 555666 that would be 2).
If "time" and "stadium" are equal to each other, take the first line of the duplicated keys (for 4447777 that would be line 3 of the dataset).
Any help would be much appreciated!
CodePudding user response:
We may arrange
the data by 'key_one', 'time' and 'stadium' in desc
ending order and get the distinct
rows by 'key_one'
library(dplyr)
df1 %>%
arrange(key_one, time, desc(stadium)) %>%
distinct(key_one, .keep_all = TRUE)
-output
key_one key_two year month stadium time age
2 0033333 9071236 2020 7 3 16 65
3 4447777 6593655 2019 2 2 20 90
6 5556666 6803714 2017 2 2 13 69
data
df1 <- structure(list(key_one = c("0033333", "0033333", "4447777", "4447777",
"5556666", "5556666"), key_two = c(3166626L, 9071236L, 6593655L,
6593655L, 6803714L, 6803714L), year = c(2020L, 2020L, 2019L,
2019L, 2017L, 2017L), month = c(6L, 7L, 2L, 2L, 2L, 2L), stadium = c(3L,
3L, 2L, 2L, 1L, 2L), time = c(37L, 16L, 20L, 20L, 13L, 13L),
age = c(65L, 65L, 90L, 90L, 69L, 69L)), row.names = c("1",
"2", "3", "4", "5", "6"), class = "data.frame")
CodePudding user response:
library(dplyr)
df %>%
group_by(key_one) %>%
slice_min(time) %>%
slice_max(stadium) %>%
distinct()