I have a dataset with student scores. There are a lot of repeat rows per student ID like this below.
ID Date Score Source
1 2016-02-24 19.2 A
2 2020-01-08 16.6 B
3 2021-01-25 18.1 A
3 2021-01-25 16.2 C
4 2011-02-28 13.2 A
4 2011-02-28 17.4 A
5 2011-02-28 19.2 A
5 2011-02-28 14.6 C
6 2016-04-16 11.2 C
6 2016-04-16 12.4 C
My goal is to exclude some repeat observations and retain only observations based on this criteria.
Rule 1 : Same ID, Same date, different source. Retain only rows where Source = A Example ID 3, 5 , retain row
3 2021-01-25 18.1 A
5 2011-02-28 19.2 A
Rule 2 : Same ID, Same date, same source. Retain rows with max of score.
4 2011-02-28 17.4 A
6 2016-04-16 12.4 C
The final expected dataset
ID Date Score Source
1 2016-02-24 19.2 A
2 2020-01-08 16.6 B
3 2021-01-25 18.1 A
4 2011-02-28 17.4 A
5 2011-02-28 19.2 A
6 2016-04-16 12.4 C
I am aware of groupby and filter but I am not sure how to apply those functions in this situation. Any suggestion is much appreciated thanks.
CodePudding user response:
library(dplyr)
dat %>%
group_by(ID, Date) %>%
filter(n_distinct(Source) == 1L | Source == "A") %>%
group_by(ID, Date, Source) %>%
summarize(Score = max(Score), .groups = "drop")
# # A tibble: 6 x 4
# ID Date Source Score
# <int> <chr> <chr> <dbl>
# 1 1 2016-02-24 A 19.2
# 2 2 2020-01-08 B 16.6
# 3 3 2021-01-25 A 18.1
# 4 4 2011-02-28 A 17.4
# 5 5 2011-02-28 A 19.2
# 6 6 2016-04-16 C 12.4
Data
dat <- structure(list(ID = c(1L, 2L, 3L, 3L, 4L, 4L, 5L, 5L, 6L, 6L), Date = c("2016-02-24", "2020-01-08", "2021-01-25", "2021-01-25", "2011-02-28", "2011-02-28", "2011-02-28", "2011-02-28", "2016-04-16", "2016-04-16"), Score = c(19.2, 16.6, 18.1, 16.2, 13.2, 17.4, 19.2, 14.6, 11.2, 12.4), Source = c("A", "B", "A", "C", "A", "A", "A", "C", "C", "C")), class = "data.frame", row.names = c(NA, -10L))