Home > Back-end >  R: Remove duplicates row based on certain criteria
R: Remove duplicates row based on certain criteria

Time:04-27

I want to remove duplicate based on certain criteria. My data look like:

Animal<-c("bird","Bird ","Dog","Cat F","Lion","Lion","Lion","dog","Horse","cat", "Lion")

A_date<-c("02-08-2020","20-06-2018","01-01-2015","10-07-2021","20-06-2018","15-08-2019","05-08-2013","20-06-2010","15-11-2016","22-03-2022","15-05-2019")

ID<-c("T1", "T1","T1","T2","T2","T3","T3","T4","T4","T5","T5")

Mydata<-data.frame(Animal, A_date,col_1)

 Animal   A_date       ID
bird     02-08-2020    T1
Bird     20-06-2018    T1
Dog      01-01-2015    T1
Cat F    10-07-2021    T2
Lion     20-06-2018    T2
Lion     15-08-2019    T3
lion     05-08-2013    T3
dog      20-06-2010    T4
Horse    15-11-2016    T4
cat      22-03-2022    T5
Lion     15-05-2019    T5

I want to remove duplicate row so that only the row with latest date pr. ID will remain. For example in the above table Lion appears 3 times with the same ID. So I want to remain with only Lion 15-08-2019 T3 but I want to keep Lion that has ID T5.

The end result should be like the following:

    Animal   A_date       ID
    Dog      01-01-2015    T1
    bird     02-08-2020    T1
    Dog      01-01-2015    T1
    Cat F    10-07-2021    T2
    Lion     15-08-2019    T3
    dog      20-06-2010    T4
    Horse    15-11-2016    T4
    cat      22-03-2022    T5
    Lion     15-05-2019    T5

The data that I working on is very large and has ID from T1 to T20. I have so fare tried the following code. But it does not work properly

library(lubridate)
library(dplyr)

Mydata <- Mydata %>%
  mutate(Animal = toupper(Animal), A_date = lubridate::dmy(A_date)) %>%
  arrange(A_date)
Mydata %>%
  filter(!duplicated(Animal, fromLast = TRUE))

the result that I get

Animal A_date ID
DOG   <NA>    T1
HORSE <NA>    T4
BIRD  <NA>    T1
LION  <NA>    T3
BIRD  <NA>    T1
CAT F <NA>    T2
CAT   <NA>    T5

This is not the end result that i want.

CodePudding user response:

One option is to group by ID and Animal, then arrange so that for each group the most recent date is at the top for that group (i.e., the latest date), then slice that row.

library(lubridate)
library(dplyr)

Mydata %>%
  mutate(Animal = trimws(toupper(Animal)), A_date = lubridate::dmy(A_date)) %>%
  group_by(ID, Animal) %>%
  arrange(ID, Animal, desc(A_date)) %>%
  slice(1)

Output

  Animal A_date     ID   
  <chr>  <date>     <chr>
1 BIRD   2020-08-02 T1   
2 DOG    2015-01-01 T1   
3 CAT F  2021-07-10 T2   
4 LION   2018-06-20 T2   
5 LION   2019-08-15 T3   
6 DOG    2010-06-20 T4   
7 HORSE  2016-11-15 T4   
8 CAT    2022-03-22 T5   
9 LION   2019-05-15 T5  

CodePudding user response:

We can try slice_max over A_date

Mydata %>%
    mutate(Animal = toupper(Animal), A_date = lubridate::dmy(A_date)) %>%
    group_by(ID, Animal) %>%
    slice_max(A_date) %>%
    ungroup()

which gives

# A tibble: 10 x 3
   Animal  A_date     ID   
   <chr>   <date>     <chr>
 1 "BIRD"  2020-08-02 T1
 2 "BIRD " 2018-06-20 T1
 3 "DOG"   2015-01-01 T1
 4 "CAT F" 2021-07-10 T2
 5 "LION"  2018-06-20 T2
 6 "LION"  2019-08-15 T3
 7 "DOG"   2010-06-20 T4
 8 "HORSE" 2016-11-15 T4
 9 "CAT"   2022-03-22 T5
10 "LION"  2019-05-15 T5
  • Related