Home > Enterprise >  How to group observations together based on (chronological) time differences in R
How to group observations together based on (chronological) time differences in R

Time:11-03

I am trying to group observations together based on how closely to each other they occured in time.

data <- data.frame(date = c("2020-04-14 03:26:58", "2020-04-14 11:26:58", "2020-04-14 12:29:20", "2020-04-14 12:48:02",
                            "2020-04-15 13:01:09", "2020-04-15 13:16:21", "2020-04-15 13:51:06", "2020-04-16 13:59:11",
                            "2020-04-16 14:01:37", "2020-04-18 20:02:37", "2020-04-18 20:17:37"))

data$date <- as.POSIXct(data$date, format="%Y-%m-%d %H:%M:%S")

head(data, 11)
                  date
1  2020-04-14 03:26:58
2  2020-04-14 11:26:58
3  2020-04-14 12:29:20
4  2020-04-14 12:48:02
5  2020-04-15 13:01:09
6  2020-04-15 13:16:21
7  2020-04-15 13:51:06
8  2020-04-16 13:59:11
9  2020-04-16 14:01:37
10 2020-04-18 20:02:37
11 2020-04-18 20:17:37

 

I want to assign the observations into discrete groups based on whether they occured within the same time period:

The rule could for be for example: Group rows together if the time difference between a row and its lagged row is less than 2 hours.

I tried creating a lagged variable and calculate the time difference between each row and its lagged row, but I can't figure out how to get from this to adding the groups.

# Create lagged date variable
data$lag <- lag(data$date)

# Calculate time difference between original and lagged variable
data$time_diff <- as.numeric(difftime(data$date, data$lag, unit = "hours"))

In this case, the desired output would add a group column to the data such as:

                  date                 lag   time_diff group
1  2020-04-14 03:26:58                <NA>          NA     A
2  2020-04-14 11:26:58 2020-04-14 03:26:58  8.00000000     B
3  2020-04-14 12:29:20 2020-04-14 11:26:58  1.03944444     B
4  2020-04-14 12:48:02 2020-04-14 12:29:20  0.31166667     B
5  2020-04-15 13:01:09 2020-04-14 12:48:02 24.21861111     C
6  2020-04-15 13:16:21 2020-04-15 13:01:09  0.25333333     C
7  2020-04-15 13:51:06 2020-04-15 13:16:21  0.57916667     C
8  2020-04-16 13:59:11 2020-04-15 13:51:06 24.13472222     D
9  2020-04-16 14:01:37 2020-04-16 13:59:11  0.04055556     D
10 2020-04-18 20:02:37 2020-04-16 14:01:37 54.01666667     E
11 2020-04-18 20:17:37 2020-04-18 20:02:37  0.25000000     E

CodePudding user response:

One data.table option:

setDT(data)
data[, diff_hours := as.numeric(difftime(date, shift(date, fill = date[1])), unit = "hours")]
data[, group := LETTERS[cumsum(diff_hours >= 2)   1L]]
#                    date  diff_hours group
#  1: 2020-04-14 03:26:58  0.00000000     A
#  2: 2020-04-14 11:26:58  8.00000000     B
#  3: 2020-04-14 12:29:20  1.03944444     B
#  4: 2020-04-14 12:48:02  0.31166667     B
#  5: 2020-04-15 13:01:09 24.21861111     C
#  6: 2020-04-15 13:16:21  0.25333333     C
#  7: 2020-04-15 13:51:06  0.57916667     C
#  8: 2020-04-16 13:59:11 24.13472222     D
#  9: 2020-04-16 14:01:37  0.04055556     D
# 10: 2020-04-18 20:02:37 54.01666667     E
# 11: 2020-04-18 20:17:37  0.25000000     E
  • Related