Home > OS >  r filter by group multiple critera
r filter by group multiple critera

Time:02-21

I have a dataset with student scores. There are a lot of repeat rows per student ID like this below.

       ID    Date       Score    Source
       1     2016-02-24 19.2     A
       2     2020-01-08 16.6     B
       3     2021-01-25 18.1     A
       3     2021-01-25 16.2     C
       4     2011-02-28 13.2     A
       4     2011-02-28 17.4     A
       5     2011-02-28 19.2     A
       5     2011-02-28 14.6     C
       6     2016-04-16 11.2     C
       6     2016-04-16 12.4     C

My goal is to exclude some repeat observations and retain only observations based on this criteria.

Rule 1 : Same ID, Same date, different source. Retain only rows where Source = A Example ID 3, 5 , retain row

      3     2021-01-25 18.1     A
      5     2011-02-28 19.2     A

Rule 2 : Same ID, Same date, same source. Retain rows with max of score.

     4     2011-02-28 17.4     A
     6     2016-04-16 12.4     C

The final expected dataset

      ID    Date       Score    Source
       1     2016-02-24 19.2     A
       2     2020-01-08 16.6     B
  
       3     2021-01-25 18.1     A
      
       4     2011-02-28 17.4     A
  
       5     2011-02-28 19.2     A
        
       6     2016-04-16 12.4     C

I am aware of groupby and filter but I am not sure how to apply those functions in this situation. Any suggestion is much appreciated thanks.

CodePudding user response:

library(dplyr)
dat %>%
  group_by(ID, Date) %>%
  filter(n_distinct(Source) == 1L | Source == "A") %>%
  group_by(ID, Date, Source) %>%
  summarize(Score = max(Score), .groups = "drop")
# # A tibble: 6 x 4
#      ID Date       Source Score
#   <int> <chr>      <chr>  <dbl>
# 1     1 2016-02-24 A       19.2
# 2     2 2020-01-08 B       16.6
# 3     3 2021-01-25 A       18.1
# 4     4 2011-02-28 A       17.4
# 5     5 2011-02-28 A       19.2
# 6     6 2016-04-16 C       12.4

Data

dat <- structure(list(ID = c(1L, 2L, 3L, 3L, 4L, 4L, 5L, 5L, 6L, 6L), Date = c("2016-02-24", "2020-01-08", "2021-01-25", "2021-01-25", "2011-02-28", "2011-02-28", "2011-02-28", "2011-02-28", "2016-04-16", "2016-04-16"), Score = c(19.2, 16.6, 18.1, 16.2, 13.2, 17.4, 19.2, 14.6, 11.2, 12.4), Source = c("A", "B", "A", "C", "A", "A", "A", "C", "C", "C")), class = "data.frame", row.names = c(NA, -10L))
  • Related