I have a column with dates (Time), I consider one duration as consecutive times of 1s:
data <- data.frame(Time = c("2021-12-01 01:01:01","2021-12-01 01:01:02","2021-12-01 01:01:03","2021-12-01 01:01:05","2021-12-01 01:01:06"))
I would like to generate an Id for each duration like this:
data <- data.frame(Time = c("2021-12-01 01:01:01","2021-12-01 01:01:02","2021-12-01 01:01:03","2021-12-01 01:01:05","2021-12-01 01:01:06"),Id = c(1,1,1,2,2))
With dplyr... Thank you
CodePudding user response:
Up front:
cumsum(c(TRUE, as.numeric(diff(as.POSIXct(data$Time)), units = "secs") > 1L))
# [1] 1 1 1 2 2
First, you should really be working with real timestamps and not strings. If you're doing anything else with your Time
field, it is almost certainly going to be a number-like operation, so you should do this up-front with
data$Time <- as.POSIXct(data$Time)
This works easily here because they are well-formed along the default format of "%Y-%m-%d %H:%M:%S"
; see ?strptime
for the %
-codes.
From here, you want to keep track of when a difference in time is more than 1 second. The differencing is easy enough with:
as.numeric(diff(data$Time), units = "secs")
# [1] 1 1 2 1
Really, the key operator is diff
, but it can report minutes or hours or such if the data is widely-enough spaced; there's an internal heuristic for that. Wrapping it in as.numeric(., units="secs")
forces it to always be in seconds.
From here, we need a cumulative sum of when it is above 1, ergo > 1L
, so cumsum(. > 1L)
.
Note that we have input length 5 but output length 4, this makes sense realizing that differences are between two elements. We force the first difference-test to be TRUE
. If you have since changed to POSIXt
-class, then the original code is reduced slightly to be
cumsum(c(TRUE, as.numeric(diff(data$Time), units = "secs") > 1L))
and therefore to store it as Id
,
data$Id <- cumsum(c(TRUE, as.numeric(diff(data$Time), units = "secs") > 1L))