Home > Software design >  Create new column based on nearby values without using for loop in R
Create new column based on nearby values without using for loop in R

Time:10-05

In the data frame provided below I want to include a column with the difference between the start time of the row and the end time of the row before. Since apply does not allow to use indices and for loops should be avoided in R I am running out of ideas on how to build this function. Here an example of the input and how it should look like at the end

|      start_time     |      end_time       | Waiting_Time |
| ------------------- | ------------------- | ------------ |
| 1970-01-12 07:24:00 | 1970-01-12 07:24:00 |      0       |
| 1970-01-12 07:24:00 | 1970-01-12 07:30:00 |      0       |
| 1970-01-12 07:34:00 | 1970-01-12 07:47:00 |      4       |
| 1970-01-12 07:45:00 | 1970-01-12 07:45:00 |     15       |
| 1970-01-12 07:47:00 | 1970-01-12 07:52:00 |      2       |
| 1970-01-12 07:58:00 | 1970-01-12 07:58:00 |      6       |
| 1970-01-12 07:58:00 | 1970-01-12 08:12:00 |      0       |
| 1970-01-12 08:12:00 | 1970-01-12 07:30:00 |      0       |
| 1970-01-12 07:24:00 | 1970-01-12 08:20:00 |     72       |
| 1970-01-12 08:26:00 | 1970-01-12 08:26:00 |      6       |

In the case that the start time is before the end time of the previous row, the function should look to two rows before (See example in the rows 4 and 9).

structure(list(Case_id = c(501L, 501L, 501L, 501L, 501L, 501L, 
501L, 501L, 501L, 501L, 501L, 501L, 501L, 501L, 501L), start_time = structure(c(977040, 
977040, 978300, 977640, 978420, 979080, 979080, 979920, 980760, 
980760, 981360, 982260, 982260, 985200, 985980), class = c("POSIXct", 
"POSIXt"), tzone = "UTC"), end_time = structure(c(977040, 977400, 
978300, 978420, 978720, 979080, 979920, 980400, 980760, 981360, 
981720, 982260, 985200, 985680, 985980), class = c("POSIXct", 
"POSIXt"), tzone = "UTC"), Resource_id = c("System", "Tester5", 
"System", "SolverC2", "Tester3", "System", "SolverC1", "Tester2", 
"System", "SolverC1", "Tester5", "System", "SolverC3", "Tester1", 
"System"), Activity_id = c("Register", "Analyze Defect", "Inform User", 
"Repair (Complex)", "Test Repair", "Restart Repair", "Repair (Complex)", 
"Test Repair", "Restart Repair", "Repair (Complex)", "Test Repair", 
"Restart Repair", "Repair (Complex)", "Test Repair", "Archive Repair"
), Log = c("ORIG", "ORIG", "ORIG", "ORIG", "ORIG", "ORIG", "ORIG", 
"ORIG", "ORIG", "ORIG", "ORIG", "ORIG", "ORIG", "ORIG", "ORIG"
), ExecTime = structure(c(0, 6, 0, 13, 5, 0, 14, 8, 0, 10, 6, 
0, 49, 8, 0), class = "difftime", units = "secs")), row.names = c(4121L, 
4122L, 4123L, 4124L, 4125L, 4126L, 4127L, 4129L, 4130L, 4132L, 
4133L, 4134L, 4135L, 4136L, 4137L), class = "data.frame")

CodePudding user response:

You can access the row before (or after) with {dplyr}'s lead() (or lag()) function(s).

For example:

library(dplyr)

df %>% 
  mutate(delta = start_time - lag(end_time)) %>% 
  select(start_time, end_time, delta) # to truncate payload, remove in your case

This yields:

              start_time            end_time     delta
4121 1970-01-12 07:24:00 1970-01-12 07:24:00   NA secs
4122 1970-01-12 07:24:00 1970-01-12 07:30:00    0 secs
4123 1970-01-12 07:45:00 1970-01-12 07:45:00  900 secs
4124 1970-01-12 07:34:00 1970-01-12 07:47:00 -660 secs
4125 1970-01-12 07:47:00 1970-01-12 07:52:00    0 secs
4126 1970-01-12 07:58:00 1970-01-12 07:58:00  360 secs
4127 1970-01-12 07:58:00 1970-01-12 08:12:00    0 secs
4129 1970-01-12 08:12:00 1970-01-12 08:20:00    0 secs
4130 1970-01-12 08:26:00 1970-01-12 08:26:00  360 secs
4132 1970-01-12 08:26:00 1970-01-12 08:36:00    0 secs
4133 1970-01-12 08:36:00 1970-01-12 08:42:00    0 secs
4134 1970-01-12 08:51:00 1970-01-12 08:51:00  540 secs
4135 1970-01-12 08:51:00 1970-01-12 09:40:00    0 secs
4136 1970-01-12 09:40:00 1970-01-12 09:48:00    0 secs
4137 1970-01-12 09:53:00 1970-01-12 09:53:00  300 secs

Obviously the first start_time has no previous entry and thus results in a NA. You may want to handle this case differently with a conditional operation or setting this value to zero.

For more control over your time deltas, read up on difftime(..., units = ...). Here you can set units to 'mins' if that is a better unit step for you.

CodePudding user response:

Base R option using difftime

df <- transform(df, Waiting_Time = c(0, difftime(start_time[-1] , 
                    end_time[-nrow(df)], units = "mins")))

df$Waiting_Time
#[1]   0   0  15 -11   0   6   0   0   6   0   0   9   0   0   5
  • Related