Get the latest time within the same date by group in R-CodePudding

I have a dataset

structure(list(country = c("US", "US", "US", "US", "US", "US",
                                        "UK", "UK"), ID = c(11, 11, 11, 11, 11, 11,
                                                                    12, 12), entry = c("2017-04-25 09:01:25", "2017-04-21 09:00:24", "2017-02-19 08:09:57", "2017-01-24 16:55:56", "2016-06-27 14:00:46",
                                                                                                    "2016-06-27 13:55:28", "2016-08-30 10:35:29", "2016-06-28 12:35:39"
                                                                    ), out = c("2017-04-21 09:41:00", "2017-04-21 09:41:00",
                                                                                           "2017-04-21 09:41:00", "2017-04-21 09:41:00", "2016-07-29 11:36:22",
                                                                                           "2016-07-29 11:36:12", NA, NA
                                                                    )), class = "data.frame", row.names = c(NA, -8L))

And the output is something like this

     country ID               entry                 out
1         US 11 2017-04-25 09:01:25 2017-04-21 09:41:00
2         US 11 2017-04-21 09:00:24 2017-04-21 09:41:00
3         US 11 2017-02-19 08:09:57 2017-04-21 09:41:00
4         US 11 2017-01-24 16:55:56 2017-04-21 09:41:00
5         US 11 2016-06-27 14:00:46 2016-07-29 11:36:22
6         US 11 2016-06-27 13:55:28 2016-07-29 11:36:12
7         UK 12 2016-08-30 10:35:29                <NA>
8         UK 12 2016-06-28 12:35:39                <NA>

What I would like to do is to remove duplicate dates for the same group of IDs and if there's a same date but with a different h:m:s value, then take the latest date of it, so it would look something like this:

  country    ID entry               out                 unique_out     
   <chr>   <dbl> <chr>               <chr>               <chr>              
 1 US         11 2017-04-25 09:01:25 2017-04-21 09:41:00 2017-04-21 09:41:00
 2 US         11 2017-04-21 09:00:24 2017-04-21 09:41:00 NA                 
 3 US         11 2017-02-19 08:09:57 2017-04-21 09:41:00 NA                 
 4 US         11 2017-01-24 16:55:56 2017-04-21 09:41:00 NA                 
 5 US         11 2016-06-27 14:00:46 2016-07-29 11:36:22 2016-07-29 11:36:22
 6 US         11 2016-06-27 13:55:28 2016-07-29 11:36:12 NA
 7 UK         12 2016-08-30 10:35:29 NA                  NA                 
 8 UK         12 2016-06-28 12:35:39 NA                  NA

I have used this data %>% group_by(ID) %>% mutate(unique_out=replace(out, duplicated(out), NA)) to filter out the duplicates but cannot figure out how to keep the latest time of the same day within a group of ID. Many thanks if you could help me with this!

CodePudding user response：

You can use data.table::rleid:

library(data.table)
setDT(dat)[order(out),unique_out:=last(out), by=.(ID,rleid(as.Date(out)))]

Output:

   country ID               entry                 out          unique_out
1:      US 11 2017-04-25 09:01:25 2017-04-21 09:41:00 2017-04-21 09:41:00
2:      US 11 2017-04-21 09:00:24 2017-04-21 09:41:00 2017-04-21 09:41:00
3:      US 11 2017-02-19 08:09:57 2017-04-21 09:41:00 2017-04-21 09:41:00
4:      US 11 2017-01-24 16:55:56 2017-04-21 09:41:00 2017-04-21 09:41:00
5:      US 11 2016-06-27 14:00:46 2016-07-29 11:36:22 2016-07-29 11:36:22
6:      US 11 2016-06-27 13:55:28 2016-07-29 11:36:12 2016-07-29 11:36:22
7:      UK 12 2016-08-30 10:35:29                <NA>                <NA>
8:      UK 12 2016-06-28 12:35:39                <NA>                <NA>

If you want the additional NA values, you can add this:

dat[rowidv(dat, c("ID","unique_out"))!=1, unique_out:=NA]

CodePudding user response：

Another possible solution, based on dplyr and on lubridate:

library(dplyr)
library(lubridate)

df %>% 
  group_by(ID, aux = date(ymd_hms(out))) %>% 
  mutate(unique_out = if_else(n() > 1 & row_number() == 1, max(ymd_hms(out)), 
    NA_POSIXct_)) %>% ungroup %>% select(-aux)

#> # A tibble: 8 × 5
#>   country    ID entry               out                 unique_out         
#>   <chr>   <dbl> <chr>               <chr>               <dttm>             
#> 1 US         11 2017-04-25 09:01:25 2017-04-21 09:41:00 2017-04-21 09:41:00
#> 2 US         11 2017-04-21 09:00:24 2017-04-21 09:41:00 NA                 
#> 3 US         11 2017-02-19 08:09:57 2017-04-21 09:41:00 NA                 
#> 4 US         11 2017-01-24 16:55:56 2017-04-21 09:41:00 NA                 
#> 5 US         11 2016-06-27 14:00:46 2016-07-29 11:36:22 2016-07-29 11:36:22
#> 6 US         11 2016-06-27 13:55:28 2016-07-29 11:36:12 NA                 
#> 7 UK         12 2016-08-30 10:35:29 <NA>                NA                 
#> 8 UK         12 2016-06-28 12:35:39 <NA>                NA

CodePudding user response：

Try this:

dat %>%
  mutate(
    across(c(entry, out), as.POSIXct),
    out_date = as.Date(out)
  ) %>%
  group_by(ID, out_date) %>%
  mutate(unique_out = if_else(row_number() %in% which.max(out), out, out[NA])) %>%
  ungroup()
# # A tibble: 8 x 6
#   country    ID entry               out                 out_date   unique_out         
#   <chr>   <dbl> <dttm>              <dttm>              <date>     <dttm>             
# 1 US         11 2017-04-25 09:01:25 2017-04-21 09:41:00 2017-04-21 2017-04-21 09:41:00
# 2 US         11 2017-04-21 09:00:24 2017-04-21 09:41:00 2017-04-21 NA                 
# 3 US         11 2017-02-19 08:09:57 2017-04-21 09:41:00 2017-04-21 NA                 
# 4 US         11 2017-01-24 16:55:56 2017-04-21 09:41:00 2017-04-21 NA                 
# 5 US         11 2016-06-27 14:00:46 2016-07-29 11:36:22 2016-07-29 2016-07-29 11:36:22
# 6 US         11 2016-06-27 13:55:28 2016-07-29 11:36:12 2016-07-29 NA                 
# 7 UK         12 2016-08-30 10:35:29 NA                  NA         NA                 
# 8 UK         12 2016-06-28 12:35:39 NA                  NA         NA

Walk-through:

in order to use which.max later, we need number-like timestamps, so we first convert to POSIXt timestamps; this step can be omitted if you need to keep your timestamps as strings, though you'll need a few more steps (starting with out == max(out) instead of which.max, but then you need to de-duplicate it);
in order to get the latest time per day, it will help to group on the date (as well as ID), so we generate a date-only variant of out;
group by both date and ID, find which is the first of the max(out) by way of which.max

CodePudding user response：

How about grouping by each day and choosing the first max value?

library(dplyr)

test |>
  group_by(ID, as.Date(out)) |>
  mutate(unique_out = ifelse(out == max(out) & row_number() == 1, out, NA)) |>
  ungroup()

Output:

# A tibble: 8 × 6
  country    ID entry               out                 `as.Date(out)` unique_out         
  <chr>   <dbl> <chr>               <chr>               <date>         <chr>              
1 US         11 2017-04-25 09:01:25 2017-04-21 09:41:00 2017-04-21     2017-04-21 09:41:00
2 US         11 2017-04-21 09:00:24 2017-04-21 09:41:00 2017-04-21     NA                 
3 US         11 2017-02-19 08:09:57 2017-04-21 09:41:00 2017-04-21     NA                 
4 US         11 2017-01-24 16:55:56 2017-04-21 09:41:00 2017-04-21     NA                 
5 US         11 2016-06-27 14:00:46 2016-07-29 11:36:22 2016-07-29     2016-07-29 11:36:22
6 US         11 2016-06-27 13:55:28 2016-07-29 11:36:12 2016-07-29     NA                 
7 UK         12 2016-08-30 10:35:29 NA                  NA             NA                 
8 UK         12 2016-06-28 12:35:39 NA                  NA             NA

Update: Very similar approach to r2evans's (who was faster); keeping it for inspiration.

CodePudding user response：

Using by, generate sub IDs using seq in ave, next assign which.max into u and replace non-matches between s and u with NA. Note, that you need to do u <- c('entry', 'out'); dat[u] <- lapply(dat[u], as.POSIXct) beforehand to work with an appropriate time i.e. "POSIXct" format.

by(dat, dat$country, \(x) { 
  s <- ave(as.numeric(x$out), strftime(x$out, '%H:%M'), FUN=seq_along)
  u <- ave(as.numeric(x$out), strftime(x$out, '%H:%M'), FUN=which.max)
  cbind(x, unique_out=replace(x$out, u != s, NA))
}) |>
  unsplit(dat$country)
#   country ID               entry                 out          unique_out
# 1      US 11 2017-04-25 09:01:25 2017-04-21 09:41:00 2017-04-21 09:41:00
# 2      US 11 2017-04-21 09:00:24 2017-04-21 09:41:00                <NA>
# 3      US 11 2017-02-19 08:09:57 2017-04-21 09:41:00                <NA>
# 4      US 11 2017-01-24 16:55:56 2017-04-21 09:41:00                <NA>
# 5      US 11 2016-06-27 14:00:46 2016-07-29 11:36:22 2016-07-29 11:36:22
# 6      US 11 2016-06-27 13:55:28 2016-07-29 11:36:12                <NA>
# 7      UK 12 2016-08-30 10:35:29                <NA>                <NA>
# 8      UK 12 2016-06-28 12:35:39                <NA>                <NA>`