struggling with for loop in R-CodePudding

I am currently trying to subset my dataset representing the employees of a firm, according to the "time_passed" in the firm, into categories (people that have passed 0 to 5 years, others that have passed 6 to 10, others 11 to 15 etc: by 4 each time). I imagine it is possible to do it without a for-loop but I would like to be able to do it with both a for-loop and the split (or subset, or any other R function) function.

Here is the structure of my dataset :

 structure(list(sex = c("F", "H", "F", "F", "H", "F"), age = c("24", 
 "33", "53", "32", "38", "21"), time_passed = c("0", "3", "4", 
 "0", "2", "0"), level = c("N7  ", "N7  ", "N9  ", "N7  ", "N8  ", 
 "    "), wage = c("2605", "4931", "11123", "3750", "6180", "858.31"
 )), row.names = c(NA, 6L), class = "data.frame")

And the for-loop I have tried, unsuccessfully :

 list_tranches <- c()

for (i in seq(from = 5, to = 40, by=5)) {
  for (j in 1:nrow(data_2021)){
    if(data_2021[j,4] %in% seq(i-5 1:i))
    tranche_i <- data_2021[j,]
    list_tranches <- c(list_tranches, tranche_i)
  }
}

Ultimately, I want to have a variable "tranche" added to my dataset df, indicating for each individual in what category of time_passed in the firm he is (0 to 5, 6 to 10 years, etc). How could I proceed ?

CodePudding user response：

Are you looking for findInterval or cut followed by split?

data_2021 <-
  structure(list(
    sex = c("F", "H", "F", "F", "H", "F"), 
    age = c("24", "33", "53", "32", "38", "21"), 
    time_passed = c("0", "3", "4", "0", "2", "0"), 
    level = c("N7  ", "N7  ", "N9  ", "N7  ", "N8  ", "    "), 
    wage = c("2605", "4931", "11123", "3750", "6180", "858.31")), 
    row.names = c(NA, 6L), 
    class = "data.frame")

data_2021$time_passed <- as.integer(data_2021$time_passed)

breaks <- seq(0, 49, by = 5)
ff <- findInterval(data_2021$time_passed, breaks)
split(data_2021, ff)
#> $`1`
#>   sex age time_passed level   wage
#> 1   F  24           0  N7     2605
#> 2   H  33           3  N7     4931
#> 3   F  53           4  N9    11123
#> 4   F  32           0  N7     3750
#> 5   H  38           2  N8     6180
#> 6   F  21           0       858.31

cc <- cut(data_2021$time_passed, breaks = breaks, include.lowest = TRUE)
cc <- droplevels(cc)
split(data_2021, cc)
#> $`[0,5]`
#>   sex age time_passed level   wage
#> 1   F  24           0  N7     2605
#> 2   H  33           3  N7     4931
#> 3   F  53           4  N9    11123
#> 4   F  32           0  N7     3750
#> 5   H  38           2  N8     6180
#> 6   F  21           0       858.31

^{Created on 2022-08-04 by the reprex package (v2.0.1)}

To add a new column tranche, use cut/split and the result's names attribute.

cc <- cut(data_2021$time_passed, breaks = breaks, include.lowest = TRUE)
cc <- droplevels(cc)
sp <- split(data_2021, cc)
res <- lapply(seq_along(sp), \(i){
  sp[[i]]$tranche <- names(sp)[i]
  sp[[i]]
})
rm(sp)
res <- do.call(rbind, res)
res
#>   sex age time_passed level   wage tranche
#> 1   F  24           0  N7     2605   [0,5]
#> 2   H  33           3  N7     4931   [0,5]
#> 3   F  53           4  N9    11123   [0,5]
#> 4   F  32           0  N7     3750   [0,5]
#> 5   H  38           2  N8     6180   [0,5]
#> 6   F  21           0       858.31   [0,5]

^{Created on 2022-08-04 by the reprex package (v2.0.1)}

CodePudding user response：

It's obviously quicker to do this without a loop. The following one-liner does the same as what you are trying to achieve:

split(data_2021, data_2021$time_passed %/% 5)

However, if you want to do it with a for loop, there are a few problems with your code. Firstly, if you are trying to compare numbers, you need to make sure that your column is numeric. Your dput shows that the time_passed column is a character column, so you need to start with:

data_2021$time_passed <- as.numeric(data_2021$time_passed)

Secondly, you should define list_tranches as a list, rather than a vector.

list_tranches <- list()

There are a few problems within your loop. Firstly, you don't need a nested loop at all, since indexing is vectorized in R. Secondly, time_passed is the third column in your data frame but you are looking for values in the fourth column. Thirdly, your seq syntax is wrong. It will always generate a sequence starting from 1.

Putting these together, we have:

for (i in seq(from = 5, to = 40, by = 5)) {
  j <- which(data_2021$time_passed %in% (i - 5:1))
  if(length(j) > 0) list_tranches[[i/5]] <- data_2021[j,]
}
  
list_tranches
#> [[1]]
#>   sex age time_passed level   wage
#> 1   F  24           0  N7     2605
#> 2   H  33           3  N7     4931
#> 3   F  53           4  N9    11123
#> 4   F  32           0  N7     3750
#> 5   H  38           2  N8     6180
#> 6   F  21           0       858.31

Of course, the example isn't great here, since all the values are in the same tranche.

^{Created on 2022-08-04 by the reprex package (v2.0.1)}