I have a dataframe with 100000 rows and multiple variables/columns from which I would like to
- Calculate duration of a certain actions based on values in the column "Y". Column Y has multiple sequences of values 0 and 1 and whenever action takes place, there is values of 1. The idea would be to count a time difference from the first 1 in a sequence of ones (right after the last 0) until the final 1 in the sequence (right before next 0). For the every corresponding row of all the ones and zeros, there is always a timestamp in column "X" for the current runtime, so the time difference would basically be calculated from that with a simple substraction:
TIME_OF_FINAL_1_IN_SEQUENCE minus TIME_OF_FIRST_1_IN_SEQUENCE
This same calculation would be repeated multiple times for all the different sequences of ones and a new dataframe listing all of the different durations for the action would be created.
- In a similar manner, for the values in the column "Z", calculate average, standard deviation, min and max from the period of first 1 of a sequence of ones until the final 1 of a sequence of ones for all of the different sequences. Then combine all the data together as one dataframe and export it as a csv-file, which should include variables for "action durations", "Z avg", "Z std", "Z min", "Z max" and the "id" column from the original dataframe. How could I write script like this in R?
The pseudo style code could probably look something like this:
for all the rows in df {
if (number 1 in column Y) {
from first 1 until the last 1 in a sequence: calculate TIME_OF_FINAL_1_IN_SEQUENCE minus TIME_OF_FIRST_1_IN_SEQUENCE from column X
ALSO from the range of first value of 1 to the last value of 1 in this sequence of 1: calculate avg, std, min, and max for the variable Z
if (number 0) in column
add new element/row to the list (including the variables of: "action duration", "Z avg", "Z std", "Z min", "Z max" and the "id") and move to the next 1
(Not sure if the algorithm in the pseudo code is exactly what I was describing in the text, but at least I tried my best to include some kind of "code example" here as well :-))
CodePudding user response:
I believe you have multiple possible sequences of consecutive rows of ones and zeros.. I think the approach is to generate a unique identifier for each sequence and the estimate the statistics you want over each of these identifiers. This is easily done using data.table
, and data.table::rleied
library(data.table)
setDT(dt)
dt[,seqid:= rleid(Y)] %>%
.[Y==1,.(
seq_dur = as.numeric(max(time)-min(time)),
meanZ = mean(Z),
stdZ=sd(Z),
minZ = min(Z),
maxZ=max(Z)), by=.(seqid)]
Output:
seqid seq_dur meanZ stdZ minZ maxZ
1: 1 4 -0.41937718 0.7936389 -1.15956013 0.7945978
2: 3 5 -0.17031761 0.8429274 -1.41463319 0.8502819
3: 5 29 -0.01909116 1.1878013 -2.32238540 2.7392739
4: 7 17 -0.14040415 0.9600719 -1.82184504 1.1401493
5: 9 6 0.14154931 1.1930633 -1.55719089 1.6827525
---
2431: 4861 13 -0.17095911 0.9193558 -2.14869215 1.1571597
2432: 4863 27 -0.06239130 1.0546947 -2.46668844 2.2189060
2433: 4865 27 -0.22289381 1.0330064 -2.32818061 2.6114507
2434: 4867 2 0.42001740 0.8060201 -0.09206373 1.3491093
2435: 4869 3 -0.68025767 1.5678846 -2.96307855 0.5092229
Input:
set.seed(123)
dt = data.table(
Y = unlist(lapply(1:10000, \(x) rep(sample(c(1,0),1), times=sample(3:8,1))))
)
dt[, time := seq(as.POSIXct("2022/1/1"),by=1, length.out=nrow(dt))]
dt[, Z:=rnorm(nrow(dt))]
head(dt)
Y time Z
1: 1 2022-01-01 00:00:00 -0.152767922
2: 1 2022-01-01 00:00:01 -1.159560131
3: 1 2022-01-01 00:00:02 0.794597776
4: 1 2022-01-01 00:00:03 -1.065863531
5: 1 2022-01-01 00:00:04 -0.513292070
6: 0 2022-01-01 00:00:05 -1.065909875
7: 0 2022-01-01 00:00:06 -0.643175787
8: 0 2022-01-01 00:00:07 0.817414048
9: 0 2022-01-01 00:00:08 -0.629111341
10: 0 2022-01-01 00:00:09 1.491066477
11: 0 2022-01-01 00:00:10 0.233849804
12: 0 2022-01-01 00:00:11 -0.007799405
13: 0 2022-01-01 00:00:12 -1.314916805
14: 0 2022-01-01 00:00:13 0.335385778
15: 0 2022-01-01 00:00:14 -0.093167347
16: 0 2022-01-01 00:00:15 0.646596214
17: 0 2022-01-01 00:00:16 -0.969331732
18: 0 2022-01-01 00:00:17 1.681191187
19: 0 2022-01-01 00:00:18 0.357307413
20: 1 2022-01-01 00:00:19 -0.940199141
21: 1 2022-01-01 00:00:20 0.059556026
22: 1 2022-01-01 00:00:21 0.098646529
23: 1 2022-01-01 00:00:22 0.324442236
24: 1 2022-01-01 00:00:23 -1.414633187
25: 1 2022-01-01 00:00:24 0.850281851
Y time Z