Adding column to dataframe dynamically-CodePudding

I have a table such as:

employee <- c('John Doe','Peter Gynn','Jolie Hope') 
salary <- c(21000, 23400, 26800)
startdate <- as.Date(c('2010-11-1','2008-3-25','2007-3-14'))

employ.data <- data.frame(employee, salary, startdate)
employ.data$Salary.category <- NA
employ.data$Salary.next.year <- 0

I am looking to add more columns. the values of each columns will be a function of the value of the salary.

I have created the following loop:

for (employee in 1:nrow(employ.data)){
  if (is.na(employ.data[employee,2]) == FALSE){
    if(employ.data[employee,2] <=  22000){
      employ.data[employee,4] = "Sub 22k"
      employ.data[employee,5] = employ.data[employee,2] * 1.20
    } else if(employ.data[employee,2] >  22000 && employ.data[employee,2] <=  23000){
      employ.data[employee,4] = "Sub 23k"
      employ.data[employee,5] = employ.data[employee,2] * 1.10
    } else if(employ.data[employee,2] >  23000){
      employ.data[employee,4] = "Sub 24k"
      employ.data[employee,5] = employ.data[employee,2] * 1.10
      }
  }
}

It works well, as the df result is :

> employ.data
    employee salary  startdate Salary.category Salary.next.year
1   John Doe  21000 2010-11-01         Sub 22k            25200
2 Peter Gynn  23400 2008-03-25         Sub 24k            25740
3 Jolie Hope  26800 2007-03-14         Sub 24k            29480

The issue is that in the actual table i have about 5 columns to add on over 1mln rows, and this takes abour 2hr running. Is there a better way to build the additional table with Apply's for example?

CodePudding user response：

This should be much faster. When I test on your dataset copied to be 150,000 rows, your loop takes 431 seconds on my old laptop but the code below takes 0.35 seconds, >1200x faster. Switching to run the same code using the high-performance collapse library increases it another 3x faster, to be about 4,000x as fast. Hopefully should get your runtime down to a couple seconds.

library(dplyr)
# options(collapse_mask = "manip")  # optional, to use faster "collapse"
# library(collapse)                 # library to run code below
employ.data %>%
  mutate(Salary.category = case_when(salary <= 22000 ~ "Sub 22k",
                                     salary <= 23000 ~ "Sub 23k",
                                     TRUE ~ "Sub 24k"),
         Salary.next.year = salary * case_when(salary <= 22000 ~ 1.2,
                                               salary <= 23000 ~ 1.1, # redundant
                                               TRUE ~ 1.1))

Larger data set:

employ.data %>% tidyr::uncount(50000) -> employ.data   # 150k rows instead of 3k

The fundamental slowdown is that your code is not vectorized, which means you are asking R to redo some compiling steps with every row, instead of doing them once and applying the calculation to all rows of data. Here's a good explanation of how that works: https://www.noamross.net/archives/2014-04-16-vectorization-in-r-why/