Is there a way to make a for-loop faster?-CodePudding

I am working on a code but it has a step that is just super slow. Basically I just need to check for 2 columns and if their value is the same (at their respective row) I mark a 1 at a third column. Like the code below:

#FLAG_REPETIDOS

df1$FLAG_REPETIDOS <- ""

j <-1

for (j in 1:nrow(df1)) {

df1$FLAG_REPETIDOS[[j]] <- ifelse(df1$DATO[[j]]==df1$DATO_ANT[[j]], 1, df1$FLAG_REPETIDOS[[j]])
df1$FLAG_REPETIDOS[[j]] <- ifelse(is.na(df1$FLAG_REPETIDOS[[j]])==TRUE, "", df1$FLAG_REPETIDOS[[j]])

x <- j/100

if ((x == round(x))==TRUE){

print(paste(j, "/", nrow(df1)))

}

}

print(paste("Check 11:", Sys.time(), sep=" "))

Some more information: I am using data table, not data frame. My computer is no the best one, only 8G RAM and the data I am using has 1M rows more or less. Accordingly with my estimative it should take around 72h to end just this step of the code, which is unreasonable.

Is my code doing something it could be done easier and faster? Is there any way to optimize it? I am new to R so I dont know a lot about optimization.

Thanks in advance

I already changed from dataframe to datatable, I've researched on google about optimization and it was one of the things I could try.

CodePudding user response：

The way to make R code go fast is to vectorize your code.

Assuming df is a dataframe, you could probably replace all your included code with something like:

library(dplyr)

df %>% 
  mutate(
    FLAG_REPETIDOS = case_when(
      is.na(DATO) | is.na(DATO_ANT) ~ "",
      DATO == DATO_ANT ~ 1,
      TRUE ~ ""
    )
  )

However, I'm not able to check since you did not include any data with your question.

CodePudding user response：

Your loop is equivalent to this much simpler and faster code.

df1$FLAG_REPETIDOS <- ""
df1$FLAG_REPETIDOS[which(df1$DATO == df1$DATO_ANT)] <- "1"

Note that which doesn't have the danger of getting NA's in the 2nd code line index.