Home > other >  Remove duplicates based on multiple conditions
Remove duplicates based on multiple conditions

Time:12-16

I have some individuals that are listed twice because they have received numerous degrees. I am trying to only get the rows with the latest degree granting date. Below are examples of the current output and the desired output

people | g_date   | wage|quarter 
personA|2009-01-01|100  |20201
personA|2009-01-01|100  |20202
personA|2010-01-01|100  |20201
personA|2010-01-01|100  |20202
personB|2012-01-01|50   |20201
personB|2012-01-01|50   |20202
personB|2012-01-01|50   |20203

Desired output
people | g_date   | wage|quarter 
personA|2010-01-01|100  |20201
personA|2010-01-01|100  |20202
personB|2012-01-01|50   |20201
personB|2012-01-01|50   |20202
personB|2012-01-01|50   |20203

I have used the code that is below but it is removing all the rows so that there is only one row per person.

df<-df[order(df$g_date),]
df<-df[!duplicated(df$people, fromLast = TRUE),]

CodePudding user response:

Another option using group_by with ordered slice_max like this:

library(dplyr)
df %>%
  group_by(people, quarter) %>%
  slice_max(order_by = g_date, n = 1)
#> # A tibble: 5 × 4
#> # Groups:   people, quarter [5]
#>   people  g_date      wage quarter
#>   <chr>   <chr>      <dbl>   <int>
#> 1 personA 2010-01-01   100   20201
#> 2 personA 2010-01-01   100   20202
#> 3 personB 2012-01-01    50   20201
#> 4 personB 2012-01-01    50   20202
#> 5 personB 2012-01-01    50   20203

Created on 2022-12-15 with reprex v2.0.2

CodePudding user response:


merge(df, aggregate(. ~ people, df[1:2], max))

#>    people     g_date wage quarter
#> 1 personA 2010-01-01  100   20201
#> 2 personA 2010-01-01  100   20202
#> 3 personB 2012-01-01   50   20201
#> 4 personB 2012-01-01   50   20202
#> 5 personB 2012-01-01   50   20203

CodePudding user response:

Update (thanks to @Villalba, removed first answer):

We colud first group arrange and then filter:

library(dplyr)
library(lubridate)

df %>% 
  group_by(people, quarter) %>% 
  mutate(g_date = ymd(g_date)) %>% 
  arrange(g_date, .by_group = TRUE) %>% 
  filter(row_number()==n())
  people  g_date      wage quarter
  <chr>   <date>     <int>   <int>
1 personA 2010-01-01   100   20201
2 personA 2010-01-01   100   20202
3 personB 2012-01-01    50   20201
4 personB 2012-01-01    50   20202
5 personB 2012-01-01    50   20203
  •  Tags:  
  • r
  • Related