Say I have a dataset below where each id can have multiple records per year. I would like to keep only the id's most recent record per year.
id<-c(1,1,1,2,2,2)
year<-c(2020,2020,2019,2020,2018,2018)
month<-c(12,6,4,5,4,1)
have<-as.data.frame(cbind(id,year,month))
have
id year month
1 2020 12
1 2020 6
1 2019 4
2 2020 5
2 2018 4
2 2018 1
This is what would like the dataset to look like:
want
id year month
1 2020 12
1 2019 4
2 2020 5
2 2018 4
I know that I can get the first instance of each id with this code, however I want the latest record for each year.
want<-have[match(unique(have$id), have$id),]
id year month
1 2020 12
2 2020 5
I modified the code to add in year, but it outputs the same results as the code above:
want<-have[match(unique(have$id,have$year), have$id),]
id year month
1 2020 12
2 2020 5
How would I modify this so I can see one record displayed per year?
CodePudding user response:
You can use dplyr::slice_min
like this:
library(dplyr)
have %>%
group_by(id,year) %>%
slice_min(order_by = month)
Output:
id year month
<dbl> <dbl> <dbl>
1 1 2019 4
2 1 2020 12
3 2 2018 4
4 2 2020 5
CodePudding user response:
We could group and then summarise with first()
library(dplyr)
have %>%
group_by(id, year) %>%
summarise(month = first(month))
id year month
<dbl> <dbl> <dbl>
1 1 2019 4
2 1 2020 12
3 2 2018 4
4 2 2020 5
CodePudding user response:
You can use the group_by in dplyr as follows:
have %>% group_by(year) %>% tally(max(month))