Home > Software design >  Find most frequent value but when there is a tie, choose the most recent in tidyverse
Find most frequent value but when there is a tie, choose the most recent in tidyverse

Time:07-03

So let's say I have some data like this:

ID  value  date
001     A  2015-12-06
001     A  2015-12-07
001     A  2015-12-08
002     B  2015-12-09
002     C  2015-12-10
003     A  2015-12-11
003     B  2015-12-12
002     B  2015-12-13
004     D  2015-12-13
004     R  2015-12-13

I want to find the value that appears most frequently for each ID. But when there is a tie, take the most recent date's value.

Expected Output:

ID  value
001     A
002     B
003     B
004     R

You may notice in the case of 004 there is the same date AND same ID during a tie. In this case, you can use the lower row entry.

CodePudding user response:

You can use the following code:

library(dplyr)
df %>%
  group_by(ID) %>%
  mutate(n = n()) %>%
  filter(date == max(date)) %>%
  summarise(value = value[1])
#> # A tibble: 4 × 2
#>      ID value
#>   <int> <chr>
#> 1     1 A    
#> 2     2 B    
#> 3     3 B    
#> 4     4 D

Created on 2022-07-02 by the reprex package (v2.0.1)

CodePudding user response:

Update, after clarification OP. See comments: Here is the version that gives the expected output:

library(dplyr)

df %>% 
  count(ID, value, date) %>% 
  group_by(ID) %>% 
  filter(date == max(date) & row_number() >1) %>% 
  dplyr::select(-n, -date)  
     ID value
  <int> <chr>
1     1 A    
2     2 B    
3     3 B    
4     4 R  

First answer: Note group 0004 has ties and no most recent date, therefore both values are kept in the dataframe:

library(dplyr)

df %>% 
  count(ID, value, date) %>% 
  group_by(ID) %>% 
  filter(date == max(date)) %>% 
  dplyr::select(-n)  
  ID    value date      
  <chr> <chr> <chr>     
1 0001  A     2015-12-08
2 0002  B     2015-12-13
3 0003  B     2015-12-12
4 0004  D     2015-12-13
5 0004  R     2015-12-13
  • Related