I am trying to group observations together based on how closely to each other they occured in time.
data <- data.frame(date = c("2020-04-14 03:26:58", "2020-04-14 11:26:58", "2020-04-14 12:29:20", "2020-04-14 12:48:02",
"2020-04-15 13:01:09", "2020-04-15 13:16:21", "2020-04-15 13:51:06", "2020-04-16 13:59:11",
"2020-04-16 14:01:37", "2020-04-18 20:02:37", "2020-04-18 20:17:37"))
data$date <- as.POSIXct(data$date, format="%Y-%m-%d %H:%M:%S")
head(data, 11)
date
1 2020-04-14 03:26:58
2 2020-04-14 11:26:58
3 2020-04-14 12:29:20
4 2020-04-14 12:48:02
5 2020-04-15 13:01:09
6 2020-04-15 13:16:21
7 2020-04-15 13:51:06
8 2020-04-16 13:59:11
9 2020-04-16 14:01:37
10 2020-04-18 20:02:37
11 2020-04-18 20:17:37
I want to assign the observations into discrete groups based on whether they occured within the same time period:
The rule could for be for example: Group rows together if the time difference between a row and its lagged row is less than 2 hours.
I tried creating a lagged variable and calculate the time difference between each row and its lagged row, but I can't figure out how to get from this to adding the groups.
# Create lagged date variable
data$lag <- lag(data$date)
# Calculate time difference between original and lagged variable
data$time_diff <- as.numeric(difftime(data$date, data$lag, unit = "hours"))
In this case, the desired output would add a group
column to the data
such as:
date lag time_diff group
1 2020-04-14 03:26:58 <NA> NA A
2 2020-04-14 11:26:58 2020-04-14 03:26:58 8.00000000 B
3 2020-04-14 12:29:20 2020-04-14 11:26:58 1.03944444 B
4 2020-04-14 12:48:02 2020-04-14 12:29:20 0.31166667 B
5 2020-04-15 13:01:09 2020-04-14 12:48:02 24.21861111 C
6 2020-04-15 13:16:21 2020-04-15 13:01:09 0.25333333 C
7 2020-04-15 13:51:06 2020-04-15 13:16:21 0.57916667 C
8 2020-04-16 13:59:11 2020-04-15 13:51:06 24.13472222 D
9 2020-04-16 14:01:37 2020-04-16 13:59:11 0.04055556 D
10 2020-04-18 20:02:37 2020-04-16 14:01:37 54.01666667 E
11 2020-04-18 20:17:37 2020-04-18 20:02:37 0.25000000 E
CodePudding user response:
One data.table
option:
setDT(data)
data[, diff_hours := as.numeric(difftime(date, shift(date, fill = date[1])), unit = "hours")]
data[, group := LETTERS[cumsum(diff_hours >= 2) 1L]]
# date diff_hours group
# 1: 2020-04-14 03:26:58 0.00000000 A
# 2: 2020-04-14 11:26:58 8.00000000 B
# 3: 2020-04-14 12:29:20 1.03944444 B
# 4: 2020-04-14 12:48:02 0.31166667 B
# 5: 2020-04-15 13:01:09 24.21861111 C
# 6: 2020-04-15 13:16:21 0.25333333 C
# 7: 2020-04-15 13:51:06 0.57916667 C
# 8: 2020-04-16 13:59:11 24.13472222 D
# 9: 2020-04-16 14:01:37 0.04055556 D
# 10: 2020-04-18 20:02:37 54.01666667 E
# 11: 2020-04-18 20:17:37 0.25000000 E