Add values from second dataframe if datetime is between start and end datetimes-CodePudding

I made a for loop in R to fetch information from dataframe2 to add to dataframe1. The loop takes a date and time from dataframe1 and looks up what "step number" corresponds to that date and time by checking between what start and end time (start and end are two different columns and have data each row) from dataframe2 it is. Then it adds this step number to dataframe1 for that particular row. However, this loop takes hours to run.

I have read that the cause of this long duration is that R has to build up the whole dataframe with each loop iteration (not sure if this is correct, but this is what I understood). I have tried different methods of doing the same thing but I could not get them to work.

This is an example of what I am doing with very small dataframes (actual data has ~60.000 rows in dataframe1 and 300 in dataframe2):

DateTime <- c("2022-09-20 15:00:00", "2022-09-20 19:00:00", "2022-09-21 15:00:00",
              "2022-09-21 19:00:00", "2022-09-22 15:00:00")
Value <- c(1,2,3,4,5)
dataframe1 <- data.frame(DateTime, Value)

Start <- c("2022-09-20 01:00:00", "2022-09-20 17:00:00", "2022-09-21 13:00:00",
           "2022-09-21 18:00:00", "2022-09-22 13:00:00")
End <- c("2022-09-20 16:00:00", "2022-09-20 23:59:59", "2022-09-21 17:00:00",
         "2022-09-21 23:00:00", "2022-09-22 19:00:00")
Step <- c(1,2,3,4,5)
dataframe2 <- data.frame(Start, End, Step)

dataframe1$Step <- 0
for (i in 1:nrow(dataframe1)) { 
  for (j in 1:nrow(dataframe2)) {
    if (dataframe1[i,1] > dataframe2[j,1] & dataframe1[i,1] < dataframe2[j,2]) {
      dataframe1[i,3] <- dataframe2[j,3]
    }
  }
}

First, I create a new column called "Step" to which the step number needs to be added. Then, I loop over each row in dataframe1 to get the date & time of that datapoint. After that, I loop through each row of dataframe2. In dataframe2, the first column has the start time and the second column has the end time of that step.

So if the Date and Time of a datapoint in dataframe1 is between the start and end time of a row in dataframe2, then the step number in that row of dataframe2 will be added to the new "Step" column in dataframe2.

As I said, it works, but it takes a long time and I think there should be a more computationally efficient way to do this.

CodePudding user response：

Based on the information you provided, there are a number of reasons for slow performance here, and they mainly have to do with the if statement and inner for loop.

1. Use `&&` and `||` inside `if` statements

You should always use && instead of & for if statements. && will only evaluate the second argument if the first was TRUE. & will always evaluate both because it is intended for vectorized conditions, where the two arguments are necessarily evaluated before the logical operation. Same applies to ||. This will save you many unnecessary comparisons and hence computational ressources:

if (dataframe1[i,1] > dataframe2[j,1] && dataframe1[i,1] < dataframe2[j,2])

2. `break` the loop

If you know that there will always be only one match, it would make sense to break the inner for-loop once the match has been found, i.e. the if statement evaluates to TRUE.

if (dataframe1[i,1] > dataframe2[j,1] && dataframe1[i,1] < dataframe2[j,2]) {
  dataframe1[i,3] <- dataframe2[j,3]
  break
}

This won't be useful if there are possibly multiple matches, which your text suggests. However, your code seems to consider only one match, so this might save you many more unnecessary iterations.

3. Consider numeric/date types

Comparisons using character vectors are much slower than comparisons of numeric values such as integer UNIX timestamps or the built-in POSIXlt/POSIXct date types. So consider applying those transformations (as.POSIXlt/as.POSIXct) to the datetime columns in your datasets before you run your loop.

4. Use vectorization instead of `for` where possible

The real deal though might be to replace the inner for loop and if statement with vectorized indexing:

aciddf$Step <- 0
for (i in 1:nrow(aciddf)) {
  indices <- which(aciddf[i,1] > VI_step_times[,1] & aciddf[i,1] < VI_step_times[,4])
  if(length(indices) > 0) {
    # use this if there is never more than one match:
    aciddf[i,4] <- VI_step_times[indices, 3]
    # use this for the sum of all matches (which your text suggests):
    aciddf[i,4] <- sum(VI_step_times[indices, 3])
    # use this for the the first match:
    aciddf[i,4] <- VI_step_times[indices[1], 3]
    # use this for the the last match (which your code suggests):
    aciddf[i,4] <- VI_step_times[indices[length(indices)], 3]
  }
}

This will, for each row in aciddf, check which row(s) in VI_step_times match and then use that index vector to assign the new value(s). This would be a nice way of making use of the vectorization functionality of R. Note that this may or may not be faster than the nested for loops, depending on how effortful it is for your machine to make the necessary evaluations.

CodePudding user response：

There are a few things that would help. First, vectorize the operations. Second, don't work with dataframes, work directly with the columns. Doing both of these would change your loop to look like this:

result <- rep(NA, nrow(dataframe1))
for (j in 1:nrow(dataframe2)) {
  result[ Datetime > Start[j] & Datetime < End[j] ] <- Step[j]
}
dataframe1$Value <- result

This is probably just an oversight in the example, but you should also make sure that Datetime, Start and End are actually date-time values, not strings.

CodePudding user response：

This solution assumes you are trying to check row 1 of the first data frame against row 1 of the second and so on.

If that's not accurate, then disregard.

    library(dplyr)
    library(purrr)
    
    dataframe1$step <- dataframe2$Step[pmap_lgl(
                            .l = list(dataframe1$DateTime,
                                      dataframe2$Start,
                                      dataframe2$End),
                            .f = \(x,y,z) {between(as.POSIXct(x),
                                                   as.POSIXct(y),
                                                   as.POSIXct(z)
                                                  )
                                          }
                                        )
                                      ]

1. Use && and || inside if statements

2. break the loop

3. Consider numeric/date types

4. Use vectorization instead of for where possible

1. Use `&&` and `||` inside `if` statements

2. `break` the loop

4. Use vectorization instead of `for` where possible