Home > front end >  Select first row for each id for each year
Select first row for each id for each year

Time:03-29

Say I have a dataset below where each id can have multiple records per year. I would like to keep only the id's most recent record per year.

id<-c(1,1,1,2,2,2)
year<-c(2020,2020,2019,2020,2018,2018)
month<-c(12,6,4,5,4,1)

have<-as.data.frame(cbind(id,year,month))

have
  id year  month
  1  2020    12
  1  2020     6
  1  2019     4
  2  2020     5
  2  2018     4
  2  2018     1

This is what would like the dataset to look like:

want

  id year   month
  1  2020     12
  1  2019      4
  2  2020      5
  2  2018      4

I know that I can get the first instance of each id with this code, however I want the latest record for each year.

want<-have[match(unique(have$id), have$id),]

  id year  month
  1  2020    12
  2  2020     5

I modified the code to add in year, but it outputs the same results as the code above:

want<-have[match(unique(have$id,have$year), have$id),]

  id year  month
  1  2020    12
  2  2020     5

How would I modify this so I can see one record displayed per year?

CodePudding user response:

You can use dplyr::slice_min like this:

library(dplyr)

have %>%
  group_by(id,year) %>%
  slice_min(order_by = month)

Output:

     id  year month
  <dbl> <dbl> <dbl>
1     1  2019     4
2     1  2020    12
3     2  2018     4
4     2  2020     5

CodePudding user response:

We could group and then summarise with first()

library(dplyr)

have  %>% 
  group_by(id, year) %>% 
  summarise(month = first(month))
     id  year month
  <dbl> <dbl> <dbl>
1     1  2019     4
2     1  2020    12
3     2  2018     4
4     2  2020     5

CodePudding user response:

You can use the group_by in dplyr as follows:

have %>% group_by(year) %>% tally(max(month))

  •  Tags:  
  • r
  • Related