I have a table such as:
employee <- c('John Doe','Peter Gynn','Jolie Hope')
salary <- c(21000, 23400, 26800)
startdate <- as.Date(c('2010-11-1','2008-3-25','2007-3-14'))
employ.data <- data.frame(employee, salary, startdate)
employ.data$Salary.category <- NA
employ.data$Salary.next.year <- 0
I am looking to add more columns. the values of each columns will be a function of the value of the salary.
I have created the following loop:
for (employee in 1:nrow(employ.data)){
if (is.na(employ.data[employee,2]) == FALSE){
if(employ.data[employee,2] <= 22000){
employ.data[employee,4] = "Sub 22k"
employ.data[employee,5] = employ.data[employee,2] * 1.20
} else if(employ.data[employee,2] > 22000 && employ.data[employee,2] <= 23000){
employ.data[employee,4] = "Sub 23k"
employ.data[employee,5] = employ.data[employee,2] * 1.10
} else if(employ.data[employee,2] > 23000){
employ.data[employee,4] = "Sub 24k"
employ.data[employee,5] = employ.data[employee,2] * 1.10
}
}
}
It works well, as the df result is :
> employ.data
employee salary startdate Salary.category Salary.next.year
1 John Doe 21000 2010-11-01 Sub 22k 25200
2 Peter Gynn 23400 2008-03-25 Sub 24k 25740
3 Jolie Hope 26800 2007-03-14 Sub 24k 29480
The issue is that in the actual table i have about 5 columns to add on over 1mln rows, and this takes abour 2hr running. Is there a better way to build the additional table with Apply's for example?
CodePudding user response:
This should be much faster. When I test on your dataset copied to be 150,000 rows, your loop takes 431 seconds on my old laptop but the code below takes 0.35 seconds, >1200x faster. Switching to run the same code using the high-performance collapse
library increases it another 3x faster, to be about 4,000x as fast. Hopefully should get your runtime down to a couple seconds.
library(dplyr)
# options(collapse_mask = "manip") # optional, to use faster "collapse"
# library(collapse) # library to run code below
employ.data %>%
mutate(Salary.category = case_when(salary <= 22000 ~ "Sub 22k",
salary <= 23000 ~ "Sub 23k",
TRUE ~ "Sub 24k"),
Salary.next.year = salary * case_when(salary <= 22000 ~ 1.2,
salary <= 23000 ~ 1.1, # redundant
TRUE ~ 1.1))
Larger data set:
employ.data %>% tidyr::uncount(50000) -> employ.data # 150k rows instead of 3k
The fundamental slowdown is that your code is not vectorized, which means you are asking R to redo some compiling steps with every row, instead of doing them once and applying the calculation to all rows of data. Here's a good explanation of how that works: https://www.noamross.net/archives/2014-04-16-vectorization-in-r-why/