I am currently working with a dataframe that contains observations with a corresponding timestamp. Please see below for a subsample of the data set.
Now, I would like to split the df into smaller segments based on the time difference in the second column. However, I don't want to create multiple new dfs, but I would like to assign different "ids" to different segments and do this for every factor in the first column.
I.e. let's say I have a cut off time of 0.4 days. Now, I want to go through the data frame and as soon as the time difference is bigger than 0.4 days, I would like that the ID from A.1, changes to A.2. The id then stays A.2 as long as the time difference is < 0.4 days. However, as soon as the time difference in the next row is >0.4 days the id should change to A.3, etc (please see desired output).
Subsample of dataset:
1 A.1 2019-02-04 22:33:41 0.0000000
2 A.1 2019-02-05 13:28:38 0.6214931
3 A.1 2019-02-05 13:28:38 0.0000000
4 A.1 2019-02-05 22:25:00 0.3724769
5 A.1 2019-02-06 16:13:35 0.7420718
6 A.1 2019-02-07 22:55:14 1.2789236
7 B.1 2019-02-07 22:55:14 0.0000000
8 B.1 2019-02-07 22:55:14 0.0000000
9 B.1 2019-02-08 12:42:42 0.5746296
10 B.1 2019-02-08 22:26:09 0.4051736
11 C.1 2019-02-08 22:26:09 0.0000000
12 C.1 2019-02-08 22:26:09 0.0000000
13 C.1 2019-02-08 22:26:09 0.0000000
14 C.1 2019-02-08 22:26:09 0.0000000
15 C.1 2019-02-12 11:23:57 3.5401389
Desired output:
1 A.1 2019-02-04 22:33:41 0.0000000
2 A.2 2019-02-05 13:28:38 0.6214931
3 A.2 2019-02-05 13:28:38 0.0000000
4 A.2 2019-02-05 22:25:00 0.3724769
5 A.3 2019-02-06 16:13:35 0.7420718
6 A.4 2019-02-07 22:55:14 1.2789236
7 B.1 2019-02-07 22:55:14 0.0000000
8 B.1 2019-02-07 22:55:14 0.0000000
9 B.2 2019-02-08 12:42:42 0.5746296
10 B.3 2019-02-08 22:26:09 0.4051736
11 C.1 2019-02-08 22:26:09 0.0000000
12 C.1 2019-02-08 22:26:09 0.0000000
13 C.1 2019-02-08 22:26:09 0.0000000
14 C.1 2019-02-08 22:26:09 0.0000000
15 C.2 2019-02-12 11:23:57 3.5401389
CodePudding user response:
In data.table
:
dt[, V1 := paste0("A.", 1 cumsum(V4 >= 0.4))]
In dplyr
:
df %>%
mutate(V1 = paste0("A.", 1 cumsum(V4 >= 0.4)))
CodePudding user response:
Check if diff
erences are larger than .4 times the number of seconds of a day and calculate cumsum
.
paste0('A.', cumsum(c(0, diff(dat$X3)) > (24*60*60)*.4) 1)
# [1] "A.1" "A.2" "A.2" "A.2" "A.3" "A.4" "A.4" "A.4" "A.5" "A.6" "A.6" "A.6"
# [13] "A.6" "A.6" "A.7"
Note: If you have the time as character format, first do
dat$X3 <- as.POSIXct(dat$X3)
to get "POSIXt"
format, which is stored in seconds.
Data:
dat <- structure(list(X1 = 1:15, X2 = c("A.1", "A.1", "A.1", "A.1",
"A.1", "A.1", "A.1", "A.1", "A.1", "A.1", "A.1", "A.1", "A.1",
"A.1", "A.1"), X3 = structure(c(1549316021, 1549369718, 1549369718,
1549401900, 1549466015, 1549576514, 1549576514, 1549576514, 1549626162,
1549661169, 1549661169, 1549661169, 1549661169, 1549661169, 1549967037
), class = c("POSIXct", "POSIXt"), tzone = ""), X4 = c(0, 0.6214931,
0, 0.3724769, 0.7420718, 1.2789236, 0, 0, 0.5746296, 0.4051736,
0, 0, 0, 0, 3.5401389)), row.names = c(NA, -15L), class = "data.frame")