Home > Enterprise >  R - if else taking too long for 16 million rows
R - if else taking too long for 16 million rows

Time:12-29

I have a dataframe of 16million rows, and I am looking to add a column based on existing column, Month. If the month is 3 or 4 or 5, the column Season will be spring, etc.

for (i in 1:nrow(df)) {
    if (df$Month[[i]] %in% c(3,4,5)) {
        df$Season[[i]] <- "Spring"
    } else if (df$Month[[i]] %in% c(6,7,8)) {
        df$Season[[i]] <- "Summer"
    } else if (df$Month[[i]] %in% c(9,10,11)) {
        df$Season[[i]] <- "Autumn"
    } else if (df$Month[[i]] %in% c(12,1,2)) {
        df$Season[[i]] <- "Winter"
    }
}

However, it is taking way too long for it to complete. What can I do?

CodePudding user response:

One of the easier and faster ways is to create a data frame of the months and seasons and then join it to your parent data frame.

Like this:

seasons<-data.frame(Month=1:12, Season=c("Winter", "Winter", rep("Spring", 3), rep("Summer", 3), rep("Autumn", 3), "Winter"))

answer <- dplyr::left_join(df, seasons)

this is assuming both data frames have matching column names "Month".

I expect about 1000x increase in performance over the for loop.

CodePudding user response:

This is more along with the lines of @Dave2e however with base R:

Season=c("Winter", "Winter", rep("Spring", 3),
        rep("Summer", 3), rep("Autumn", 3), "Winter")
df<-data.frame(month=sample(1:12,10,replace=T)) #Sample data
df$season<-Season[df$month]
df
#   month season
#1      8 Summer
#2      8 Summer
#3      5 Spring
#4      7 Summer
#5      2 Winter
#6      4 Spring
#7     12 Winter
#8      7 Summer
#9     11 Autumn
#10     1 Winter

This one, is significantly faster than the for loop method.

Using for loop (1000 rows):

   #user  system elapsed 
   #0.02    0.00    0.02

Using vectorised method (1000 rows):

   #user  system elapsed 
   #   0       0       0

Calculated using system.time.

This difference might look insignificant considering there are only 1000 rows. However, it becomes a significantly large difference as the number of rows are increased (in OP's case, 16 million)

  •  Tags:  
  • r
  • Related