Home > Blockchain >  Splitting data frame into segments based on a cutoff value in a column in R
Splitting data frame into segments based on a cutoff value in a column in R

Time:12-24

I am currently working with a dataframe that contains observations with a corresponding timestamp. Please see below for a subsample of the data set.

Now, I would like to split the df into smaller segments based on the time difference in the second column. However, I don't want to create multiple new dfs, but I would like to assign different "ids" to different segments.

I.e. let's say I have a cut off time of 0.4 days. Now, I want to go through the data frame and as soon as the time difference is bigger than 0.4 days, I would like that the ID from A.1, changes to A.2. The id then stays A.2 as long as the time difference is < 0.4 days. However, as soon as the time difference in the next row is >0.4 days the id should change to A.3, etc (please see desired output).

Subsample of dataset:

1  A.1 2019-02-04 22:33:41  0.0000000
2  A.1 2019-02-05 13:28:38  0.6214931
3  A.1 2019-02-05 13:28:38  0.0000000
4  A.1 2019-02-05 22:25:00  0.3724769
5  A.1 2019-02-06 16:13:35  0.7420718
6  A.1 2019-02-07 22:55:14  1.2789236
7  A.1 2019-02-07 22:55:14  0.0000000
8  A.1 2019-02-07 22:55:14  0.0000000
9  A.1 2019-02-08 12:42:42  0.5746296
10 A.1 2019-02-08 22:26:09  0.4051736
11 A.1 2019-02-08 22:26:09  0.0000000
12 A.1 2019-02-08 22:26:09  0.0000000
13 A.1 2019-02-08 22:26:09  0.0000000
14 A.1 2019-02-08 22:26:09  0.0000000
15 A.1 2019-02-12 11:23:57  3.5401389

Desired output:

1  A.1 2019-02-04 22:33:41  0.0000000
2  A.2 2019-02-05 13:28:38  0.6214931
3  A.2 2019-02-05 13:28:38  0.0000000
4  A.2 2019-02-05 22:25:00  0.3724769
5  A.3 2019-02-06 16:13:35  0.7420718
6  A.4 2019-02-07 22:55:14  1.2789236
7  A.4 2019-02-07 22:55:14  0.0000000
8  A.4 2019-02-07 22:55:14  0.0000000
9  A.5 2019-02-08 12:42:42  0.5746296
10 A.6 2019-02-08 22:26:09  0.4051736
11 A.6 2019-02-08 22:26:09  0.0000000
12 A.6 2019-02-08 22:26:09  0.0000000
13 A.6 2019-02-08 22:26:09  0.0000000
14 A.6 2019-02-08 22:26:09  0.0000000
15 A.7 2019-02-12 11:23:57  3.5401389

CodePudding user response:

In data.table:

dt[, V1 := paste0("A.", 1 cumsum(V4 >= 0.4))]

In dplyr:

df %>%
  mutate(V1 = paste0("A.", 1 cumsum(V4 >= 0.4)))

CodePudding user response:

Check if differences are larger than .4 times the number of seconds of a day and calculate cumsum.

paste0('A.', cumsum(c(0, diff(dat$X3)) > (24*60*60)*.4)   1)
# [1] "A.1" "A.2" "A.2" "A.2" "A.3" "A.4" "A.4" "A.4" "A.5" "A.6" "A.6" "A.6"
# [13] "A.6" "A.6" "A.7"

Note: If you have the time as character format, first do

dat$X3 <- as.POSIXct(dat$X3)

to get "POSIXt" format, which is stored in seconds.


Data:

dat <- structure(list(X1 = 1:15, X2 = c("A.1", "A.1", "A.1", "A.1", 
"A.1", "A.1", "A.1", "A.1", "A.1", "A.1", "A.1", "A.1", "A.1", 
"A.1", "A.1"), X3 = structure(c(1549316021, 1549369718, 1549369718, 
1549401900, 1549466015, 1549576514, 1549576514, 1549576514, 1549626162, 
1549661169, 1549661169, 1549661169, 1549661169, 1549661169, 1549967037
), class = c("POSIXct", "POSIXt"), tzone = ""), X4 = c(0, 0.6214931, 
0, 0.3724769, 0.7420718, 1.2789236, 0, 0, 0.5746296, 0.4051736, 
0, 0, 0, 0, 3.5401389)), row.names = c(NA, -15L), class = "data.frame")
  • Related