I am currently trying to subset my dataset representing the employees of a firm, according to the "time_passed" in the firm, into categories (people that have passed 0 to 5 years, others that have passed 6 to 10, others 11 to 15 etc: by 4 each time). I imagine it is possible to do it without a for-loop but I would like to be able to do it with both a for-loop and the split (or subset, or any other R function) function.
Here is the structure of my dataset :
structure(list(sex = c("F", "H", "F", "F", "H", "F"), age = c("24",
"33", "53", "32", "38", "21"), time_passed = c("0", "3", "4",
"0", "2", "0"), level = c("N7 ", "N7 ", "N9 ", "N7 ", "N8 ",
" "), wage = c("2605", "4931", "11123", "3750", "6180", "858.31"
)), row.names = c(NA, 6L), class = "data.frame")
And the for-loop I have tried, unsuccessfully :
list_tranches <- c()
for (i in seq(from = 5, to = 40, by=5)) {
for (j in 1:nrow(data_2021)){
if(data_2021[j,4] %in% seq(i-5 1:i))
tranche_i <- data_2021[j,]
list_tranches <- c(list_tranches, tranche_i)
}
}
Ultimately, I want to have a variable "tranche" added to my dataset df, indicating for each individual in what category of time_passed in the firm he is (0 to 5, 6 to 10 years, etc). How could I proceed ?
CodePudding user response:
Are you looking for findInterval
or cut
followed by split
?
data_2021 <-
structure(list(
sex = c("F", "H", "F", "F", "H", "F"),
age = c("24", "33", "53", "32", "38", "21"),
time_passed = c("0", "3", "4", "0", "2", "0"),
level = c("N7 ", "N7 ", "N9 ", "N7 ", "N8 ", " "),
wage = c("2605", "4931", "11123", "3750", "6180", "858.31")),
row.names = c(NA, 6L),
class = "data.frame")
data_2021$time_passed <- as.integer(data_2021$time_passed)
breaks <- seq(0, 49, by = 5)
ff <- findInterval(data_2021$time_passed, breaks)
split(data_2021, ff)
#> $`1`
#> sex age time_passed level wage
#> 1 F 24 0 N7 2605
#> 2 H 33 3 N7 4931
#> 3 F 53 4 N9 11123
#> 4 F 32 0 N7 3750
#> 5 H 38 2 N8 6180
#> 6 F 21 0 858.31
cc <- cut(data_2021$time_passed, breaks = breaks, include.lowest = TRUE)
cc <- droplevels(cc)
split(data_2021, cc)
#> $`[0,5]`
#> sex age time_passed level wage
#> 1 F 24 0 N7 2605
#> 2 H 33 3 N7 4931
#> 3 F 53 4 N9 11123
#> 4 F 32 0 N7 3750
#> 5 H 38 2 N8 6180
#> 6 F 21 0 858.31
Created on 2022-08-04 by the reprex package (v2.0.1)
To add a new column tranche
, use cut/split
and the result's names attribute.
cc <- cut(data_2021$time_passed, breaks = breaks, include.lowest = TRUE)
cc <- droplevels(cc)
sp <- split(data_2021, cc)
res <- lapply(seq_along(sp), \(i){
sp[[i]]$tranche <- names(sp)[i]
sp[[i]]
})
rm(sp)
res <- do.call(rbind, res)
res
#> sex age time_passed level wage tranche
#> 1 F 24 0 N7 2605 [0,5]
#> 2 H 33 3 N7 4931 [0,5]
#> 3 F 53 4 N9 11123 [0,5]
#> 4 F 32 0 N7 3750 [0,5]
#> 5 H 38 2 N8 6180 [0,5]
#> 6 F 21 0 858.31 [0,5]
Created on 2022-08-04 by the reprex package (v2.0.1)
CodePudding user response:
It's obviously quicker to do this without a loop. The following one-liner does the same as what you are trying to achieve:
split(data_2021, data_2021$time_passed %/% 5)
However, if you want to do it with a for loop, there are a few problems with your code. Firstly, if you are trying to compare numbers, you need to make sure that your column is numeric. Your dput
shows that the time_passed
column is a character column, so you need to start with:
data_2021$time_passed <- as.numeric(data_2021$time_passed)
Secondly, you should define list_tranches
as a list
, rather than a vector.
list_tranches <- list()
There are a few problems within your loop. Firstly, you don't need a nested loop at all, since indexing is vectorized in R. Secondly, time_passed
is the third column in your data frame but you are looking for values in the fourth column. Thirdly, your seq
syntax is wrong. It will always generate a sequence starting from 1.
Putting these together, we have:
for (i in seq(from = 5, to = 40, by = 5)) {
j <- which(data_2021$time_passed %in% (i - 5:1))
if(length(j) > 0) list_tranches[[i/5]] <- data_2021[j,]
}
list_tranches
#> [[1]]
#> sex age time_passed level wage
#> 1 F 24 0 N7 2605
#> 2 H 33 3 N7 4931
#> 3 F 53 4 N9 11123
#> 4 F 32 0 N7 3750
#> 5 H 38 2 N8 6180
#> 6 F 21 0 858.31
Of course, the example isn't great here, since all the values are in the same tranche.
Created on 2022-08-04 by the reprex package (v2.0.1)