Home > Enterprise >  How do I add a column to a data frame in R that uses information from multiple columns and rows?
How do I add a column to a data frame in R that uses information from multiple columns and rows?

Time:12-08

I am trying to add a column to my data frame Bus78uniM, called travel_time that contains the time difference between two adjacent rows, plus the difference in the deviation. In order to do this so far I have created a function Trav_time, which takes four inputs; the dataset, row number, and column numbers that are being used:

Trav_time = function(df,i,j,k){
  
  if((df[i,1]-df[i-1,1])==2){
    trav_time = as.numeric(df[i,j] df[i,k]-df[i-1,j]-df[i-1,k])
    return(trav_time)
  }
  else if((df[i,1]-df[i-1,1])==5){
    trav_time = as.numeric(df[i,j] df[i,k]-df[i-1,j]-df[i-1,k])
    return(trav_time)
  }
  else{
    return(NA)
  }
}

The function returns the correct values (for my purposes) for each row value of the data frame, but I can't find a way to nicely join all the individual values to the rest of the data frame.

So far, I have tried adding an empty row and then filling it using a for loop:

bus78_uniM['travel_time'] <- NA
for(i in 2:nrow(bus78_uniM)){
bus78_uniM[i,11]<- bus78_uniM[Trav_time(bus78_uniM,i,5,6),11]
}

But this returns the error message:


Error: Assigned data `bus78_uniM[Trav_time(bus78_uniM, i, 5, 6), 11]` must be compatible with row subscript `i`.
x 1 row must be assigned.
x Assigned data has 46955 rows.
i Row updates require a list value. Do you need `list()` or `as.list()`?

Was wondering if there was a better way to do this, or alternatively, a way to modify the function so that I can simply use the dplyr function sapply() instead.

Thanks in advance for any tips!

Edit: A snapshot of the dataset:

enter image description here

CodePudding user response:

If travel_time should contain the difference between consecutive rows, then you can use the dplyr::lag function. Below there is an example. It's not clear to me what is the nature of the issue with the "missing rows inbetween the rows". Maybe you can provide a sample of your actual data (using dput) and describe the issue in details.

library(tidyverse)
library(lubridate)

set.seed(124)
lags <- sample.int(100, 10)
departure <-  ymd_hms("2022-06-04 12:00:00")

tibble(sec = lags)  %>% 
  mutate(timestamp = departure   cumsum(sec)) %>% 
  mutate(lagged_timestamp = lag(timestamp, default = departure)) %>% 
  mutate(interval = timestamp - lagged_timestamp)
# A tibble: 10 × 4
     sec timestamp           lagged_timestamp    interval
   <int> <dttm>              <dttm>              <drtn>  
 1    65 2022-06-04 12:01:05 2022-06-04 12:00:00  65 secs
 2   167 2022-06-04 12:03:52 2022-06-04 12:01:05 167 secs
 3   155 2022-06-04 12:06:27 2022-06-04 12:03:52 155 secs
 4     5 2022-06-04 12:06:32 2022-06-04 12:06:27   5 secs
 5   134 2022-06-04 12:08:46 2022-06-04 12:06:32 134 secs
 6   173 2022-06-04 12:11:39 2022-06-04 12:08:46 173 secs
 7    74 2022-06-04 12:12:53 2022-06-04 12:11:39  74 secs
 8   161 2022-06-04 12:15:34 2022-06-04 12:12:53 161 secs
 9   143 2022-06-04 12:17:57 2022-06-04 12:15:34 143 secs
10    91 2022-06-04 12:19:28 2022-06-04 12:17:57  91 secs

CodePudding user response:

I ended up finding a solution! Thank you to everyone who responded to the post - it really helped me put it all together. I ended up creating a vector using the for loop, and then adding the vector to the dataset. Also modified the original function so that all the outputs were of type 'numeric'.

my_vec = c() for(i in 1:nrow(bus78_uniM)){ c<-Trav_time(bus78_uniM,i,5,6 my_vec<-c(my_vec,c)} bus78_uniM['travel_time'] <- my_vec

I apologize for the formatting - still learning how to use stack.overflow

  • Related