I have a dataset
structure(list(country = c("US", "US", "US", "US", "US", "US",
"UK", "UK"), ID = c(11, 11, 11, 11, 11, 11,
12, 12), entry = c("2017-04-25 09:01:25", "2017-04-21 09:00:24", "2017-02-19 08:09:57", "2017-01-24 16:55:56", "2016-06-27 14:00:46",
"2016-06-27 13:55:28", "2016-08-30 10:35:29", "2016-06-28 12:35:39"
), out = c("2017-04-21 09:41:00", "2017-04-21 09:41:00",
"2017-04-21 09:41:00", "2017-04-21 09:41:00", "2016-07-29 11:36:22",
"2016-07-29 11:36:12", NA, NA
)), class = "data.frame", row.names = c(NA, -8L))
And the output is something like this
country ID entry out
1 US 11 2017-04-25 09:01:25 2017-04-21 09:41:00
2 US 11 2017-04-21 09:00:24 2017-04-21 09:41:00
3 US 11 2017-02-19 08:09:57 2017-04-21 09:41:00
4 US 11 2017-01-24 16:55:56 2017-04-21 09:41:00
5 US 11 2016-06-27 14:00:46 2016-07-29 11:36:22
6 US 11 2016-06-27 13:55:28 2016-07-29 11:36:12
7 UK 12 2016-08-30 10:35:29 <NA>
8 UK 12 2016-06-28 12:35:39 <NA>
What I would like to do is to remove duplicate dates for the same group of IDs and if there's a same date but with a different h:m:s value, then take the latest date of it, so it would look something like this:
country ID entry out unique_out
<chr> <dbl> <chr> <chr> <chr>
1 US 11 2017-04-25 09:01:25 2017-04-21 09:41:00 2017-04-21 09:41:00
2 US 11 2017-04-21 09:00:24 2017-04-21 09:41:00 NA
3 US 11 2017-02-19 08:09:57 2017-04-21 09:41:00 NA
4 US 11 2017-01-24 16:55:56 2017-04-21 09:41:00 NA
5 US 11 2016-06-27 14:00:46 2016-07-29 11:36:22 2016-07-29 11:36:22
6 US 11 2016-06-27 13:55:28 2016-07-29 11:36:12 NA
7 UK 12 2016-08-30 10:35:29 NA NA
8 UK 12 2016-06-28 12:35:39 NA NA
I have used this data %>% group_by(ID) %>% mutate(unique_out=replace(out, duplicated(out), NA))
to filter out the duplicates but cannot figure out how to keep the latest time of the same day within a group of ID.
Many thanks if you could help me with this!
CodePudding user response:
You can use data.table::rleid
:
library(data.table)
setDT(dat)[order(out),unique_out:=last(out), by=.(ID,rleid(as.Date(out)))]
Output:
country ID entry out unique_out
1: US 11 2017-04-25 09:01:25 2017-04-21 09:41:00 2017-04-21 09:41:00
2: US 11 2017-04-21 09:00:24 2017-04-21 09:41:00 2017-04-21 09:41:00
3: US 11 2017-02-19 08:09:57 2017-04-21 09:41:00 2017-04-21 09:41:00
4: US 11 2017-01-24 16:55:56 2017-04-21 09:41:00 2017-04-21 09:41:00
5: US 11 2016-06-27 14:00:46 2016-07-29 11:36:22 2016-07-29 11:36:22
6: US 11 2016-06-27 13:55:28 2016-07-29 11:36:12 2016-07-29 11:36:22
7: UK 12 2016-08-30 10:35:29 <NA> <NA>
8: UK 12 2016-06-28 12:35:39 <NA> <NA>
If you want the additional NA
values, you can add this:
dat[rowidv(dat, c("ID","unique_out"))!=1, unique_out:=NA]
CodePudding user response:
Another possible solution, based on dplyr
and on lubridate
:
library(dplyr)
library(lubridate)
df %>%
group_by(ID, aux = date(ymd_hms(out))) %>%
mutate(unique_out = if_else(n() > 1 & row_number() == 1, max(ymd_hms(out)),
NA_POSIXct_)) %>% ungroup %>% select(-aux)
#> # A tibble: 8 × 5
#> country ID entry out unique_out
#> <chr> <dbl> <chr> <chr> <dttm>
#> 1 US 11 2017-04-25 09:01:25 2017-04-21 09:41:00 2017-04-21 09:41:00
#> 2 US 11 2017-04-21 09:00:24 2017-04-21 09:41:00 NA
#> 3 US 11 2017-02-19 08:09:57 2017-04-21 09:41:00 NA
#> 4 US 11 2017-01-24 16:55:56 2017-04-21 09:41:00 NA
#> 5 US 11 2016-06-27 14:00:46 2016-07-29 11:36:22 2016-07-29 11:36:22
#> 6 US 11 2016-06-27 13:55:28 2016-07-29 11:36:12 NA
#> 7 UK 12 2016-08-30 10:35:29 <NA> NA
#> 8 UK 12 2016-06-28 12:35:39 <NA> NA
CodePudding user response:
Try this:
dat %>%
mutate(
across(c(entry, out), as.POSIXct),
out_date = as.Date(out)
) %>%
group_by(ID, out_date) %>%
mutate(unique_out = if_else(row_number() %in% which.max(out), out, out[NA])) %>%
ungroup()
# # A tibble: 8 x 6
# country ID entry out out_date unique_out
# <chr> <dbl> <dttm> <dttm> <date> <dttm>
# 1 US 11 2017-04-25 09:01:25 2017-04-21 09:41:00 2017-04-21 2017-04-21 09:41:00
# 2 US 11 2017-04-21 09:00:24 2017-04-21 09:41:00 2017-04-21 NA
# 3 US 11 2017-02-19 08:09:57 2017-04-21 09:41:00 2017-04-21 NA
# 4 US 11 2017-01-24 16:55:56 2017-04-21 09:41:00 2017-04-21 NA
# 5 US 11 2016-06-27 14:00:46 2016-07-29 11:36:22 2016-07-29 2016-07-29 11:36:22
# 6 US 11 2016-06-27 13:55:28 2016-07-29 11:36:12 2016-07-29 NA
# 7 UK 12 2016-08-30 10:35:29 NA NA NA
# 8 UK 12 2016-06-28 12:35:39 NA NA NA
Walk-through:
- in order to use
which.max
later, we need number-like timestamps, so we first convert toPOSIXt
timestamps; this step can be omitted if you need to keep your timestamps as strings, though you'll need a few more steps (starting without == max(out)
instead ofwhich.max
, but then you need to de-duplicate it); - in order to get the latest time per day, it will help to group on the date (as well as
ID
), so we generate a date-only variant ofout
; - group by both date and ID, find which is the first of the
max(out)
by way ofwhich.max
CodePudding user response:
How about grouping by each day and choosing the first max value?
library(dplyr)
test |>
group_by(ID, as.Date(out)) |>
mutate(unique_out = ifelse(out == max(out) & row_number() == 1, out, NA)) |>
ungroup()
Output:
# A tibble: 8 × 6
country ID entry out `as.Date(out)` unique_out
<chr> <dbl> <chr> <chr> <date> <chr>
1 US 11 2017-04-25 09:01:25 2017-04-21 09:41:00 2017-04-21 2017-04-21 09:41:00
2 US 11 2017-04-21 09:00:24 2017-04-21 09:41:00 2017-04-21 NA
3 US 11 2017-02-19 08:09:57 2017-04-21 09:41:00 2017-04-21 NA
4 US 11 2017-01-24 16:55:56 2017-04-21 09:41:00 2017-04-21 NA
5 US 11 2016-06-27 14:00:46 2016-07-29 11:36:22 2016-07-29 2016-07-29 11:36:22
6 US 11 2016-06-27 13:55:28 2016-07-29 11:36:12 2016-07-29 NA
7 UK 12 2016-08-30 10:35:29 NA NA NA
8 UK 12 2016-06-28 12:35:39 NA NA NA
Update: Very similar approach to r2evans's (who was faster); keeping it for inspiration.
CodePudding user response:
Using by
, generate s
ub IDs using seq
in ave
, next assign which.max
into u
and replace
non-matches between s
and u
with NA
. Note, that you need to do u <- c('entry', 'out'); dat[u] <- lapply(dat[u], as.POSIXct)
beforehand to work with an appropriate time i.e. "POSIXct"
format.
by(dat, dat$country, \(x) {
s <- ave(as.numeric(x$out), strftime(x$out, '%H:%M'), FUN=seq_along)
u <- ave(as.numeric(x$out), strftime(x$out, '%H:%M'), FUN=which.max)
cbind(x, unique_out=replace(x$out, u != s, NA))
}) |>
unsplit(dat$country)
# country ID entry out unique_out
# 1 US 11 2017-04-25 09:01:25 2017-04-21 09:41:00 2017-04-21 09:41:00
# 2 US 11 2017-04-21 09:00:24 2017-04-21 09:41:00 <NA>
# 3 US 11 2017-02-19 08:09:57 2017-04-21 09:41:00 <NA>
# 4 US 11 2017-01-24 16:55:56 2017-04-21 09:41:00 <NA>
# 5 US 11 2016-06-27 14:00:46 2016-07-29 11:36:22 2016-07-29 11:36:22
# 6 US 11 2016-06-27 13:55:28 2016-07-29 11:36:12 <NA>
# 7 UK 12 2016-08-30 10:35:29 <NA> <NA>
# 8 UK 12 2016-06-28 12:35:39 <NA> <NA>`