Home > Enterprise >  Get the latest time within the same date by group in R
Get the latest time within the same date by group in R

Time:07-12

I have a dataset

structure(list(country = c("US", "US", "US", "US", "US", "US",
                                        "UK", "UK"), ID = c(11, 11, 11, 11, 11, 11,
                                                                    12, 12), entry = c("2017-04-25 09:01:25", "2017-04-21 09:00:24", "2017-02-19 08:09:57", "2017-01-24 16:55:56", "2016-06-27 14:00:46",
                                                                                                    "2016-06-27 13:55:28", "2016-08-30 10:35:29", "2016-06-28 12:35:39"
                                                                    ), out = c("2017-04-21 09:41:00", "2017-04-21 09:41:00",
                                                                                           "2017-04-21 09:41:00", "2017-04-21 09:41:00", "2016-07-29 11:36:22",
                                                                                           "2016-07-29 11:36:12", NA, NA
                                                                    )), class = "data.frame", row.names = c(NA, -8L))

And the output is something like this

     country ID               entry                 out
1         US 11 2017-04-25 09:01:25 2017-04-21 09:41:00
2         US 11 2017-04-21 09:00:24 2017-04-21 09:41:00
3         US 11 2017-02-19 08:09:57 2017-04-21 09:41:00
4         US 11 2017-01-24 16:55:56 2017-04-21 09:41:00
5         US 11 2016-06-27 14:00:46 2016-07-29 11:36:22
6         US 11 2016-06-27 13:55:28 2016-07-29 11:36:12
7         UK 12 2016-08-30 10:35:29                <NA>
8         UK 12 2016-06-28 12:35:39                <NA>

What I would like to do is to remove duplicate dates for the same group of IDs and if there's a same date but with a different h:m:s value, then take the latest date of it, so it would look something like this:

  country    ID entry               out                 unique_out     
   <chr>   <dbl> <chr>               <chr>               <chr>              
 1 US         11 2017-04-25 09:01:25 2017-04-21 09:41:00 2017-04-21 09:41:00
 2 US         11 2017-04-21 09:00:24 2017-04-21 09:41:00 NA                 
 3 US         11 2017-02-19 08:09:57 2017-04-21 09:41:00 NA                 
 4 US         11 2017-01-24 16:55:56 2017-04-21 09:41:00 NA                 
 5 US         11 2016-06-27 14:00:46 2016-07-29 11:36:22 2016-07-29 11:36:22
 6 US         11 2016-06-27 13:55:28 2016-07-29 11:36:12 NA
 7 UK         12 2016-08-30 10:35:29 NA                  NA                 
 8 UK         12 2016-06-28 12:35:39 NA                  NA 

I have used this data %>% group_by(ID) %>% mutate(unique_out=replace(out, duplicated(out), NA)) to filter out the duplicates but cannot figure out how to keep the latest time of the same day within a group of ID. Many thanks if you could help me with this!

CodePudding user response:

You can use data.table::rleid:

library(data.table)
setDT(dat)[order(out),unique_out:=last(out), by=.(ID,rleid(as.Date(out)))]

Output:

   country ID               entry                 out          unique_out
1:      US 11 2017-04-25 09:01:25 2017-04-21 09:41:00 2017-04-21 09:41:00
2:      US 11 2017-04-21 09:00:24 2017-04-21 09:41:00 2017-04-21 09:41:00
3:      US 11 2017-02-19 08:09:57 2017-04-21 09:41:00 2017-04-21 09:41:00
4:      US 11 2017-01-24 16:55:56 2017-04-21 09:41:00 2017-04-21 09:41:00
5:      US 11 2016-06-27 14:00:46 2016-07-29 11:36:22 2016-07-29 11:36:22
6:      US 11 2016-06-27 13:55:28 2016-07-29 11:36:12 2016-07-29 11:36:22
7:      UK 12 2016-08-30 10:35:29                <NA>                <NA>
8:      UK 12 2016-06-28 12:35:39                <NA>                <NA>

If you want the additional NA values, you can add this:

dat[rowidv(dat, c("ID","unique_out"))!=1, unique_out:=NA]

CodePudding user response:

Another possible solution, based on dplyr and on lubridate:

library(dplyr)
library(lubridate)

df %>% 
  group_by(ID, aux = date(ymd_hms(out))) %>% 
  mutate(unique_out = if_else(n() > 1 & row_number() == 1, max(ymd_hms(out)), 
    NA_POSIXct_)) %>% ungroup %>% select(-aux)

#> # A tibble: 8 × 5
#>   country    ID entry               out                 unique_out         
#>   <chr>   <dbl> <chr>               <chr>               <dttm>             
#> 1 US         11 2017-04-25 09:01:25 2017-04-21 09:41:00 2017-04-21 09:41:00
#> 2 US         11 2017-04-21 09:00:24 2017-04-21 09:41:00 NA                 
#> 3 US         11 2017-02-19 08:09:57 2017-04-21 09:41:00 NA                 
#> 4 US         11 2017-01-24 16:55:56 2017-04-21 09:41:00 NA                 
#> 5 US         11 2016-06-27 14:00:46 2016-07-29 11:36:22 2016-07-29 11:36:22
#> 6 US         11 2016-06-27 13:55:28 2016-07-29 11:36:12 NA                 
#> 7 UK         12 2016-08-30 10:35:29 <NA>                NA                 
#> 8 UK         12 2016-06-28 12:35:39 <NA>                NA

CodePudding user response:

Try this:

dat %>%
  mutate(
    across(c(entry, out), as.POSIXct),
    out_date = as.Date(out)
  ) %>%
  group_by(ID, out_date) %>%
  mutate(unique_out = if_else(row_number() %in% which.max(out), out, out[NA])) %>%
  ungroup()
# # A tibble: 8 x 6
#   country    ID entry               out                 out_date   unique_out         
#   <chr>   <dbl> <dttm>              <dttm>              <date>     <dttm>             
# 1 US         11 2017-04-25 09:01:25 2017-04-21 09:41:00 2017-04-21 2017-04-21 09:41:00
# 2 US         11 2017-04-21 09:00:24 2017-04-21 09:41:00 2017-04-21 NA                 
# 3 US         11 2017-02-19 08:09:57 2017-04-21 09:41:00 2017-04-21 NA                 
# 4 US         11 2017-01-24 16:55:56 2017-04-21 09:41:00 2017-04-21 NA                 
# 5 US         11 2016-06-27 14:00:46 2016-07-29 11:36:22 2016-07-29 2016-07-29 11:36:22
# 6 US         11 2016-06-27 13:55:28 2016-07-29 11:36:12 2016-07-29 NA                 
# 7 UK         12 2016-08-30 10:35:29 NA                  NA         NA                 
# 8 UK         12 2016-06-28 12:35:39 NA                  NA         NA                 

Walk-through:

  • in order to use which.max later, we need number-like timestamps, so we first convert to POSIXt timestamps; this step can be omitted if you need to keep your timestamps as strings, though you'll need a few more steps (starting with out == max(out) instead of which.max, but then you need to de-duplicate it);
  • in order to get the latest time per day, it will help to group on the date (as well as ID), so we generate a date-only variant of out;
  • group by both date and ID, find which is the first of the max(out) by way of which.max

CodePudding user response:

How about grouping by each day and choosing the first max value?

library(dplyr)

test |>
  group_by(ID, as.Date(out)) |>
  mutate(unique_out = ifelse(out == max(out) & row_number() == 1, out, NA)) |>
  ungroup()

Output:

# A tibble: 8 × 6
  country    ID entry               out                 `as.Date(out)` unique_out         
  <chr>   <dbl> <chr>               <chr>               <date>         <chr>              
1 US         11 2017-04-25 09:01:25 2017-04-21 09:41:00 2017-04-21     2017-04-21 09:41:00
2 US         11 2017-04-21 09:00:24 2017-04-21 09:41:00 2017-04-21     NA                 
3 US         11 2017-02-19 08:09:57 2017-04-21 09:41:00 2017-04-21     NA                 
4 US         11 2017-01-24 16:55:56 2017-04-21 09:41:00 2017-04-21     NA                 
5 US         11 2016-06-27 14:00:46 2016-07-29 11:36:22 2016-07-29     2016-07-29 11:36:22
6 US         11 2016-06-27 13:55:28 2016-07-29 11:36:12 2016-07-29     NA                 
7 UK         12 2016-08-30 10:35:29 NA                  NA             NA                 
8 UK         12 2016-06-28 12:35:39 NA                  NA             NA                 

Update: Very similar approach to r2evans's (who was faster); keeping it for inspiration.

CodePudding user response:

Using by, generate sub IDs using seq in ave, next assign which.max into u and replace non-matches between s and u with NA. Note, that you need to do u <- c('entry', 'out'); dat[u] <- lapply(dat[u], as.POSIXct) beforehand to work with an appropriate time i.e. "POSIXct" format.

by(dat, dat$country, \(x) { 
  s <- ave(as.numeric(x$out), strftime(x$out, '%H:%M'), FUN=seq_along)
  u <- ave(as.numeric(x$out), strftime(x$out, '%H:%M'), FUN=which.max)
  cbind(x, unique_out=replace(x$out, u != s, NA))
}) |>
  unsplit(dat$country)
#   country ID               entry                 out          unique_out
# 1      US 11 2017-04-25 09:01:25 2017-04-21 09:41:00 2017-04-21 09:41:00
# 2      US 11 2017-04-21 09:00:24 2017-04-21 09:41:00                <NA>
# 3      US 11 2017-02-19 08:09:57 2017-04-21 09:41:00                <NA>
# 4      US 11 2017-01-24 16:55:56 2017-04-21 09:41:00                <NA>
# 5      US 11 2016-06-27 14:00:46 2016-07-29 11:36:22 2016-07-29 11:36:22
# 6      US 11 2016-06-27 13:55:28 2016-07-29 11:36:12                <NA>
# 7      UK 12 2016-08-30 10:35:29                <NA>                <NA>
# 8      UK 12 2016-06-28 12:35:39                <NA>                <NA>`
  •  Tags:  
  • r
  • Related