Home > Blockchain >  How to remove duplicate records based off a date/time timestamp column using tidyverse? R
How to remove duplicate records based off a date/time timestamp column using tidyverse? R

Time:12-08

I have a dataset that looks like this.

library(tidyverse)

data <- structure(list(Timestamp = c("12/7/2021 11:29:15", "12/7/2021 11:29:35", 
                                     "12/7/2021 11:59:18", "12/7/2021 12:28:53", 
                                     "12/7/2021 13:19:12"), 
                        ID = c(157767, 380925, 319956, 375889, 375889), 
                        Status = c("Yes", "No", "Yes", "No", "Yes")), 
                        row.names = c(NA, -5L), class = c("tbl_df",                                                                                             
                        "tbl", "data.frame"))

print(data)
> data
# A tibble: 5 x 3
  Timestamp              ID Status
  <chr>               <dbl> <chr> 
1 12/7/2021 11:29:15 157767 Yes   
2 12/7/2021 11:29:35 380925 No    
3 12/7/2021 11:59:18 319956 Yes   
4 12/7/2021 12:28:53 375889 No    
5 12/7/2021 13:19:12 375889 Yes  

Rows 4 & 5 have the same ID. Based off the TimeStamp, I can see that row 5 was done later and has and updated status.

Is it possible for me to filter based off timestamp to only show the most recent timestamps in my data frame? my Desired output would be the following

  Timestamp              ID Status
  <chr>               <dbl> <chr> 
1 12/7/2021 11:29:15 157767 Yes   
2 12/7/2021 11:29:35 380925 No    
3 12/7/2021 11:59:18 319956 Yes      
4 12/7/2021 13:19:12 375889 Yes 

Thank you.

CodePudding user response:

With dplyr::slice_max you could do:

Note: As your Timestamp is a character we first have to convert to a datetime to make this work (Thanks to @utubun for pointing that out in his comment)

library(dplyr)

data %>% 
  group_by(ID) %>% 
  slice_max(as.POSIXct(Timestamp, format = "%d/%m/%Y %H:%M:%S")) %>% 
  ungroup() %>% 
  arrange(Timestamp)
#> # A tibble: 4 × 3
#>   Timestamp              ID Status
#>   <chr>               <dbl> <chr> 
#> 1 12/7/2021 11:29:15 157767 Yes   
#> 2 12/7/2021 11:29:35 380925 No    
#> 3 12/7/2021 11:59:18 319956 Yes   
#> 4 12/7/2021 13:19:12 375889 Yes

CodePudding user response:

  1. First transform Timestamp to dttm class.
  2. group and filter by last()
library(dplyr)
library(lubridate)
data %>% 
  mutate(Timestamp = dmy_hms(Timestamp)) %>% 
  group_by(ID) %>% 
  filter(Timestamp == last(Timestamp))
  Timestamp               ID Status
  <dttm>               <dbl> <chr> 
1 2021-07-12 11:29:15 157767 Yes   
2 2021-07-12 11:29:35 380925 No    
3 2021-07-12 11:59:18 319956 Yes   
4 2021-07-12 13:19:12 375889 Yes 
``´
  • Related