Generate id for durations with dplyr-CodePudding

I have a column with dates (Time), I consider one duration as consecutive times of 1s:

data <- data.frame(Time = c("2021-12-01 01:01:01","2021-12-01 01:01:02","2021-12-01 01:01:03","2021-12-01 01:01:05","2021-12-01 01:01:06"))

I would like to generate an Id for each duration like this:

data <- data.frame(Time = c("2021-12-01 01:01:01","2021-12-01 01:01:02","2021-12-01 01:01:03","2021-12-01 01:01:05","2021-12-01 01:01:06"),Id = c(1,1,1,2,2))

With dplyr... Thank you

CodePudding user response：

Up front:

cumsum(c(TRUE, as.numeric(diff(as.POSIXct(data$Time)), units = "secs") > 1L))
# [1] 1 1 1 2 2

First, you should really be working with real timestamps and not strings. If you're doing anything else with your Time field, it is almost certainly going to be a number-like operation, so you should do this up-front with

data$Time <- as.POSIXct(data$Time)

This works easily here because they are well-formed along the default format of "%Y-%m-%d %H:%M:%S"; see ?strptime for the %-codes.

From here, you want to keep track of when a difference in time is more than 1 second. The differencing is easy enough with:

as.numeric(diff(data$Time), units = "secs")
# [1] 1 1 2 1

Really, the key operator is diff, but it can report minutes or hours or such if the data is widely-enough spaced; there's an internal heuristic for that. Wrapping it in as.numeric(., units="secs") forces it to always be in seconds.

From here, we need a cumulative sum of when it is above 1, ergo > 1L, so cumsum(. > 1L).

Note that we have input length 5 but output length 4, this makes sense realizing that differences are between two elements. We force the first difference-test to be TRUE. If you have since changed to POSIXt-class, then the original code is reduced slightly to be

cumsum(c(TRUE, as.numeric(diff(data$Time), units = "secs") > 1L))

and therefore to store it as Id,

data$Id <- cumsum(c(TRUE, as.numeric(diff(data$Time), units = "secs") > 1L))