I made a for loop in R to fetch information from dataframe2 to add to dataframe1. The loop takes a date and time from dataframe1 and looks up what "step number" corresponds to that date and time by checking between what start and end time (start and end are two different columns and have data each row) from dataframe2 it is. Then it adds this step number to dataframe1 for that particular row. However, this loop takes hours to run.
I have read that the cause of this long duration is that R has to build up the whole dataframe with each loop iteration (not sure if this is correct, but this is what I understood). I have tried different methods of doing the same thing but I could not get them to work.
This is an example of what I am doing with very small dataframes (actual data has ~60.000 rows in dataframe1 and 300 in dataframe2):
DateTime <- c("2022-09-20 15:00:00", "2022-09-20 19:00:00", "2022-09-21 15:00:00",
"2022-09-21 19:00:00", "2022-09-22 15:00:00")
Value <- c(1,2,3,4,5)
dataframe1 <- data.frame(DateTime, Value)
Start <- c("2022-09-20 01:00:00", "2022-09-20 17:00:00", "2022-09-21 13:00:00",
"2022-09-21 18:00:00", "2022-09-22 13:00:00")
End <- c("2022-09-20 16:00:00", "2022-09-20 23:59:59", "2022-09-21 17:00:00",
"2022-09-21 23:00:00", "2022-09-22 19:00:00")
Step <- c(1,2,3,4,5)
dataframe2 <- data.frame(Start, End, Step)
dataframe1$Step <- 0
for (i in 1:nrow(dataframe1)) {
for (j in 1:nrow(dataframe2)) {
if (dataframe1[i,1] > dataframe2[j,1] & dataframe1[i,1] < dataframe2[j,2]) {
dataframe1[i,3] <- dataframe2[j,3]
}
}
}
First, I create a new column called "Step" to which the step number needs to be added. Then, I loop over each row in dataframe1 to get the date & time of that datapoint. After that, I loop through each row of dataframe2. In dataframe2, the first column has the start time and the second column has the end time of that step.
So if the Date and Time of a datapoint in dataframe1 is between the start and end time of a row in dataframe2, then the step number in that row of dataframe2 will be added to the new "Step" column in dataframe2.
As I said, it works, but it takes a long time and I think there should be a more computationally efficient way to do this.
CodePudding user response:
Based on the information you provided, there are a number of reasons for slow performance here, and they mainly have to do with the if
statement and inner for
loop.
1. Use &&
and ||
inside if
statements
You should always use &&
instead of &
for if
statements. &&
will only evaluate the second argument if the first was TRUE
. &
will always evaluate both because it is intended for vectorized conditions, where the two arguments are necessarily evaluated before the logical operation. Same applies to ||
. This will save you many unnecessary comparisons and hence computational ressources:
if (dataframe1[i,1] > dataframe2[j,1] && dataframe1[i,1] < dataframe2[j,2])
2. break
the loop
If you know that there will always be only one match, it would make sense to break
the inner for
-loop once the match has been found, i.e. the if
statement evaluates to TRUE
.
if (dataframe1[i,1] > dataframe2[j,1] && dataframe1[i,1] < dataframe2[j,2]) {
dataframe1[i,3] <- dataframe2[j,3]
break
}
This won't be useful if there are possibly multiple matches, which your text suggests. However, your code seems to consider only one match, so this might save you many more unnecessary iterations.
3. Consider numeric/date types
Comparisons using character vectors are much slower than comparisons of numeric values such as integer UNIX timestamps or the built-in POSIXlt
/POSIXct
date types. So consider applying those transformations (as.POSIXlt
/as.POSIXct
) to the datetime columns in your datasets before you run your loop.
4. Use vectorization instead of for
where possible
The real deal though might be to replace the inner for
loop and if
statement with vectorized indexing:
aciddf$Step <- 0
for (i in 1:nrow(aciddf)) {
indices <- which(aciddf[i,1] > VI_step_times[,1] & aciddf[i,1] < VI_step_times[,4])
if(length(indices) > 0) {
# use this if there is never more than one match:
aciddf[i,4] <- VI_step_times[indices, 3]
# use this for the sum of all matches (which your text suggests):
aciddf[i,4] <- sum(VI_step_times[indices, 3])
# use this for the the first match:
aciddf[i,4] <- VI_step_times[indices[1], 3]
# use this for the the last match (which your code suggests):
aciddf[i,4] <- VI_step_times[indices[length(indices)], 3]
}
}
This will, for each row in aciddf
, check which row(s) in VI_step_times
match and then use that index vector to assign the new value(s). This would be a nice way of making use of the vectorization functionality of R. Note that this may or may not be faster than the nested for
loops, depending on how effortful it is for your machine to make the necessary evaluations.
CodePudding user response:
There are a few things that would help. First, vectorize the operations. Second, don't work with dataframes, work directly with the columns. Doing both of these would change your loop to look like this:
result <- rep(NA, nrow(dataframe1))
for (j in 1:nrow(dataframe2)) {
result[ Datetime > Start[j] & Datetime < End[j] ] <- Step[j]
}
dataframe1$Value <- result
This is probably just an oversight in the example, but you should also make sure that Datetime
, Start
and End
are actually date-time values, not strings.
CodePudding user response:
This solution assumes you are trying to check row 1 of the first data frame against row 1 of the second and so on.
If that's not accurate, then disregard.
library(dplyr)
library(purrr)
dataframe1$step <- dataframe2$Step[pmap_lgl(
.l = list(dataframe1$DateTime,
dataframe2$Start,
dataframe2$End),
.f = \(x,y,z) {between(as.POSIXct(x),
as.POSIXct(y),
as.POSIXct(z)
)
}
)
]