Home > Blockchain >  Calculating duration and key figures (avg, std, min, max) for the certain actions/variables in R?
Calculating duration and key figures (avg, std, min, max) for the certain actions/variables in R?

Time:03-14

I have a dataframe with 100000 rows and multiple variables/columns from which I would like to

  1. Calculate duration of a certain actions based on values in the column "Y". Column Y has multiple sequences of values 0 and 1 and whenever action takes place, there is values of 1. The idea would be to count a time difference from the first 1 in a sequence of ones (right after the last 0) until the final 1 in the sequence (right before next 0). For the every corresponding row of all the ones and zeros, there is always a timestamp in column "X" for the current runtime, so the time difference would basically be calculated from that with a simple substraction:
TIME_OF_FINAL_1_IN_SEQUENCE minus TIME_OF_FIRST_1_IN_SEQUENCE 

This same calculation would be repeated multiple times for all the different sequences of ones and a new dataframe listing all of the different durations for the action would be created.

  1. In a similar manner, for the values in the column "Z", calculate average, standard deviation, min and max from the period of first 1 of a sequence of ones until the final 1 of a sequence of ones for all of the different sequences. Then combine all the data together as one dataframe and export it as a csv-file, which should include variables for "action durations", "Z avg", "Z std", "Z min", "Z max" and the "id" column from the original dataframe. How could I write script like this in R?

The pseudo style code could probably look something like this:

for all the rows in df {
   if (number 1 in column Y) {
      from first 1 until the last 1 in a sequence: calculate TIME_OF_FINAL_1_IN_SEQUENCE minus TIME_OF_FIRST_1_IN_SEQUENCE from column X
      ALSO from the range of first value of 1 to the last value of 1 in this sequence of 1:  calculate avg, std, min, and max for the variable Z 
   if (number 0) in column
      add new element/row to the list (including the variables of: "action duration", "Z avg", "Z std", "Z min", "Z max" and the "id") and move to the next 1

(Not sure if the algorithm in the pseudo code is exactly what I was describing in the text, but at least I tried my best to include some kind of "code example" here as well :-))

CodePudding user response:

I believe you have multiple possible sequences of consecutive rows of ones and zeros.. I think the approach is to generate a unique identifier for each sequence and the estimate the statistics you want over each of these identifiers. This is easily done using data.table, and data.table::rleied

library(data.table)
setDT(dt)

dt[,seqid:= rleid(Y)] %>%
  .[Y==1,.(
    seq_dur = as.numeric(max(time)-min(time)),
    meanZ = mean(Z),
    stdZ=sd(Z),
    minZ = min(Z),
    maxZ=max(Z)), by=.(seqid)]

Output:

      seqid seq_dur       meanZ      stdZ        minZ      maxZ
   1:     1       4 -0.41937718 0.7936389 -1.15956013 0.7945978
   2:     3       5 -0.17031761 0.8429274 -1.41463319 0.8502819
   3:     5      29 -0.01909116 1.1878013 -2.32238540 2.7392739
   4:     7      17 -0.14040415 0.9600719 -1.82184504 1.1401493
   5:     9       6  0.14154931 1.1930633 -1.55719089 1.6827525
  ---                                                          
2431:  4861      13 -0.17095911 0.9193558 -2.14869215 1.1571597
2432:  4863      27 -0.06239130 1.0546947 -2.46668844 2.2189060
2433:  4865      27 -0.22289381 1.0330064 -2.32818061 2.6114507
2434:  4867       2  0.42001740 0.8060201 -0.09206373 1.3491093
2435:  4869       3 -0.68025767 1.5678846 -2.96307855 0.5092229

Input:

set.seed(123)
dt = data.table(
  Y = unlist(lapply(1:10000, \(x) rep(sample(c(1,0),1), times=sample(3:8,1))))
)
dt[, time := seq(as.POSIXct("2022/1/1"),by=1, length.out=nrow(dt))]
dt[, Z:=rnorm(nrow(dt))]

head(dt)

    Y                time            Z
 1: 1 2022-01-01 00:00:00 -0.152767922
 2: 1 2022-01-01 00:00:01 -1.159560131
 3: 1 2022-01-01 00:00:02  0.794597776
 4: 1 2022-01-01 00:00:03 -1.065863531
 5: 1 2022-01-01 00:00:04 -0.513292070
 6: 0 2022-01-01 00:00:05 -1.065909875
 7: 0 2022-01-01 00:00:06 -0.643175787
 8: 0 2022-01-01 00:00:07  0.817414048
 9: 0 2022-01-01 00:00:08 -0.629111341
10: 0 2022-01-01 00:00:09  1.491066477
11: 0 2022-01-01 00:00:10  0.233849804
12: 0 2022-01-01 00:00:11 -0.007799405
13: 0 2022-01-01 00:00:12 -1.314916805
14: 0 2022-01-01 00:00:13  0.335385778
15: 0 2022-01-01 00:00:14 -0.093167347
16: 0 2022-01-01 00:00:15  0.646596214
17: 0 2022-01-01 00:00:16 -0.969331732
18: 0 2022-01-01 00:00:17  1.681191187
19: 0 2022-01-01 00:00:18  0.357307413
20: 1 2022-01-01 00:00:19 -0.940199141
21: 1 2022-01-01 00:00:20  0.059556026
22: 1 2022-01-01 00:00:21  0.098646529
23: 1 2022-01-01 00:00:22  0.324442236
24: 1 2022-01-01 00:00:23 -1.414633187
25: 1 2022-01-01 00:00:24  0.850281851
    Y                time            Z
  • Related