I am trying to add a column to my data frame Bus78uniM, called travel_time that contains the time difference between two adjacent rows, plus the difference in the deviation. In order to do this so far I have created a function Trav_time, which takes four inputs; the dataset, row number, and column numbers that are being used:
Trav_time = function(df,i,j,k){ if((df[i,1]-df[i-1,1])==2){ trav_time = as.numeric(df[i,j] df[i,k]-df[i-1,j]-df[i-1,k]) return(trav_time) } else if((df[i,1]-df[i-1,1])==5){ trav_time = as.numeric(df[i,j] df[i,k]-df[i-1,j]-df[i-1,k]) return(trav_time) } else{ return(NA) } }
The function returns the correct values (for my purposes) for each row value of the data frame, but I can't find a way to nicely join all the individual values to the rest of the data frame.
So far, I have tried adding an empty row and then filling it using a for loop:
bus78_uniM['travel_time'] <- NA for(i in 2:nrow(bus78_uniM)){ bus78_uniM[i,11]<- bus78_uniM[Trav_time(bus78_uniM,i,5,6),11] }
But this returns the error message:
Error: Assigned data `bus78_uniM[Trav_time(bus78_uniM, i, 5, 6), 11]` must be compatible with row subscript `i`.
x 1 row must be assigned.
x Assigned data has 46955 rows.
i Row updates require a list value. Do you need `list()` or `as.list()`?
Was wondering if there was a better way to do this, or alternatively, a way to modify the function so that I can simply use the dplyr function sapply() instead.
Thanks in advance for any tips!
Edit: A snapshot of the dataset:
CodePudding user response:
If travel_time
should contain the difference between consecutive rows, then you can use the dplyr::lag
function. Below there is an example.
It's not clear to me what is the nature of the issue with the "missing rows inbetween the rows". Maybe you can provide a sample of your actual data (using dput
) and describe the issue in details.
library(tidyverse)
library(lubridate)
set.seed(124)
lags <- sample.int(100, 10)
departure <- ymd_hms("2022-06-04 12:00:00")
tibble(sec = lags) %>%
mutate(timestamp = departure cumsum(sec)) %>%
mutate(lagged_timestamp = lag(timestamp, default = departure)) %>%
mutate(interval = timestamp - lagged_timestamp)
# A tibble: 10 × 4
sec timestamp lagged_timestamp interval
<int> <dttm> <dttm> <drtn>
1 65 2022-06-04 12:01:05 2022-06-04 12:00:00 65 secs
2 167 2022-06-04 12:03:52 2022-06-04 12:01:05 167 secs
3 155 2022-06-04 12:06:27 2022-06-04 12:03:52 155 secs
4 5 2022-06-04 12:06:32 2022-06-04 12:06:27 5 secs
5 134 2022-06-04 12:08:46 2022-06-04 12:06:32 134 secs
6 173 2022-06-04 12:11:39 2022-06-04 12:08:46 173 secs
7 74 2022-06-04 12:12:53 2022-06-04 12:11:39 74 secs
8 161 2022-06-04 12:15:34 2022-06-04 12:12:53 161 secs
9 143 2022-06-04 12:17:57 2022-06-04 12:15:34 143 secs
10 91 2022-06-04 12:19:28 2022-06-04 12:17:57 91 secs
CodePudding user response:
I ended up finding a solution! Thank you to everyone who responded to the post - it really helped me put it all together. I ended up creating a vector using the for loop, and then adding the vector to the dataset. Also modified the original function so that all the outputs were of type 'numeric'.
my_vec = c() for(i in 1:nrow(bus78_uniM)){ c<-Trav_time(bus78_uniM,i,5,6 my_vec<-c(my_vec,c)} bus78_uniM['travel_time'] <- my_vec
I apologize for the formatting - still learning how to use stack.overflow