So let's say I have some data like this:
ID value date
001 A 2015-12-06
001 A 2015-12-07
001 A 2015-12-08
002 B 2015-12-09
002 C 2015-12-10
003 A 2015-12-11
003 B 2015-12-12
002 B 2015-12-13
004 D 2015-12-13
004 R 2015-12-13
I want to find the value
that appears most frequently for each ID. But when there is a tie, take the most recent date's value.
Expected Output:
ID value
001 A
002 B
003 B
004 R
You may notice in the case of 004
there is the same date AND same ID during a tie. In this case, you can use the lower row entry.
CodePudding user response:
You can use the following code:
library(dplyr)
df %>%
group_by(ID) %>%
mutate(n = n()) %>%
filter(date == max(date)) %>%
summarise(value = value[1])
#> # A tibble: 4 × 2
#> ID value
#> <int> <chr>
#> 1 1 A
#> 2 2 B
#> 3 3 B
#> 4 4 D
Created on 2022-07-02 by the reprex package (v2.0.1)
CodePudding user response:
Update, after clarification OP. See comments: Here is the version that gives the expected output:
library(dplyr)
df %>%
count(ID, value, date) %>%
group_by(ID) %>%
filter(date == max(date) & row_number() >1) %>%
dplyr::select(-n, -date)
ID value
<int> <chr>
1 1 A
2 2 B
3 3 B
4 4 R
First answer: Note group 0004 has ties and no most recent date, therefore both values are kept in the dataframe:
library(dplyr)
df %>%
count(ID, value, date) %>%
group_by(ID) %>%
filter(date == max(date)) %>%
dplyr::select(-n)
ID value date
<chr> <chr> <chr>
1 0001 A 2015-12-08
2 0002 B 2015-12-13
3 0003 B 2015-12-12
4 0004 D 2015-12-13
5 0004 R 2015-12-13