I have a dataframe of 16million rows, and I am looking to add a column based on existing column, Month. If the month is 3 or 4 or 5, the column Season will be spring, etc.
for (i in 1:nrow(df)) {
if (df$Month[[i]] %in% c(3,4,5)) {
df$Season[[i]] <- "Spring"
} else if (df$Month[[i]] %in% c(6,7,8)) {
df$Season[[i]] <- "Summer"
} else if (df$Month[[i]] %in% c(9,10,11)) {
df$Season[[i]] <- "Autumn"
} else if (df$Month[[i]] %in% c(12,1,2)) {
df$Season[[i]] <- "Winter"
}
}
However, it is taking way too long for it to complete. What can I do?
CodePudding user response:
One of the easier and faster ways is to create a data frame of the months and seasons and then join it to your parent data frame.
Like this:
seasons<-data.frame(Month=1:12, Season=c("Winter", "Winter", rep("Spring", 3), rep("Summer", 3), rep("Autumn", 3), "Winter"))
answer <- dplyr::left_join(df, seasons)
this is assuming both data frames have matching column names "Month".
I expect about 1000x increase in performance over the for
loop.
CodePudding user response:
This is more along with the lines of @Dave2e however with base R:
Season=c("Winter", "Winter", rep("Spring", 3),
rep("Summer", 3), rep("Autumn", 3), "Winter")
df<-data.frame(month=sample(1:12,10,replace=T)) #Sample data
df$season<-Season[df$month]
df
# month season
#1 8 Summer
#2 8 Summer
#3 5 Spring
#4 7 Summer
#5 2 Winter
#6 4 Spring
#7 12 Winter
#8 7 Summer
#9 11 Autumn
#10 1 Winter
This one, is significantly faster than the for
loop method.
Using for
loop (1000 rows):
#user system elapsed
#0.02 0.00 0.02
Using vectorised method (1000 rows):
#user system elapsed
# 0 0 0
Calculated using system.time
.
This difference might look insignificant considering there are only 1000 rows. However, it becomes a significantly large difference as the number of rows are increased (in OP's case, 16 million)