We need to optimize a program will perform a treatment on each "nb_student" depending on the value number, generate a list number corresponding to the value entered. Once this list is obtained, another program will have to count according to a ranking rule.
ranking rule
if number of student :
- is less than 1 => increment group A
- is between 1 and 3 => increment group B
- is between 3 and 4 => increment group C
- is greater than 4 => increment group D
initial data
"category_name" "nb_student"
A 6,00000
A 10,00000
B 12,0000
C 74,0000
D 6,00000
init data code
DT = data.table(
category_name = c("A","B","C","D"),
nb_student = c(6,12,74,6)
)
function for each row
treatment_group <- function(nb_student){
nb_group_A = nb_group_B = nb_groupe_C = nb_groupe_D <- 0
limit_1 <- 1
limit_2 <- 3
limit_3 <- 4
list <- runif(nb_student, 0, 5)
for (i in list) {
if(i < limit_1){
nb_group_A <- nb_group_A 1
}else if(i > limit_1 & i < limit_2){
nb_group_B <- nb_group_B 1
}else if(i > limit_3){
nb_groupe_C <- nb_groupe_C 1
}else {
nb_groupe_D <- nb_groupe_D 1
}
}
list(nb_group_A, nb_group_B, nb_groupe_C, nb_groupe_D)
}
result
DT[ , c("group A", "group B", "group C", "group D") := tratment_group(nb_student), by = seq_len(nrow(DT))]
The final result must match this table
"category_name" "nb_student" "group A" "group B" "group C" "group D"
A 6,00000 0,00000 2,00000 4,00000 0,00000
A 10,00000 3,00000 3,00000 4,00000 0,00000
B 12,0000 2,00000 9,00000 0,00000 1,00000
C 74,0000 14,0000 29,0000 15,0000 16,0000
D 6,00000 0,00000 1,00000 3,00000 2,00000
this code works, but i want to optimize it to run with 200000 rows. Maybe using parallelization ?
CodePudding user response:
I guess you can try findInterval
set.seed(1)
DT[
,
c(
.SD,
as.data.frame(
t(as.matrix(table(
factor(
findInterval(runif(nb_student, 0, 5), c(1, 3, 4)) 1,
levels = 1:4,
label = paste("group", LETTERS[1:4])
)
)))
)
),
category_name
]
which gives
category_name nb_student group A group B group C group D
1: A 6 0 4 0 2
2: B 12 2 3 5 2
3: C 74 11 35 17 11
4: D 6 0 2 3 1