We need to optimize a program will perform a treatment on each "nb_student" depending on the value number, generate a list number corresponding to the value entered. Once this list is obtained, another program will have to count according to a ranking rule.

ranking rule

if number of student :

is less than 1 => increment group A
is between 1 and 3 => increment group B
is between 3 and 4 => increment group C
is greater than 4 => increment group D

initial data

  "category_name" "nb_student"    
    A                   6,00000
    A                   10,00000            
    B                   12,0000         
    C                   74,0000     
    D                   6,00000

init data code

DT = data.table(
      category_name = c("A","B","C","D"),
      nb_student = c(6,12,74,6)
    )

function for each row

treatment_group <- function(nb_student){
    nb_group_A =  nb_group_B = nb_groupe_C = nb_groupe_D <- 0
    
    limit_1 <- 1
    limit_2 <- 3
    limit_3 <- 4
    
    list <- runif(nb_student, 0, 5)
    
    for (i in list) {
    if(i < limit_1){
      nb_group_A <- nb_group_A   1
    }else if(i > limit_1 & i < limit_2){
      nb_group_B <- nb_group_B   1
    }else if(i > limit_3){
      nb_groupe_C <- nb_groupe_C   1
    }else {
      nb_groupe_D <- nb_groupe_D   1
    }
  }

  list(nb_group_A, nb_group_B, nb_groupe_C, nb_groupe_D)
}

result

DT[ , c("group A", "group B", "group C", "group D") := tratment_group(nb_student), by = seq_len(nrow(DT))]

The final result must match this table

"category_name" "nb_student"           "group A"       "group B"       "group C"     "group D"
       A             6,00000            0,00000         2,00000         4,00000       0,00000
       A             10,00000           3,00000         3,00000         4,00000       0,00000
       B             12,0000            2,00000         9,00000         0,00000       1,00000
       C             74,0000            14,0000         29,0000         15,0000       16,0000
       D             6,00000            0,00000         1,00000         3,00000       2,00000

this code works, but i want to optimize it to run with 200000 rows. Maybe using parallelization ?

CodePudding user response：

I guess you can try findInterval

set.seed(1)
DT[
  ,
  c(
    .SD,
    as.data.frame(
      t(as.matrix(table(
        factor(
          findInterval(runif(nb_student, 0, 5), c(1, 3, 4))   1,
          levels = 1:4,
          label = paste("group", LETTERS[1:4])
        )
      )))
    )
  ),
  category_name
]

which gives

   category_name nb_student group A group B group C group D
1:             A          6       0       4       0       2
2:             B         12       2       3       5       2
3:             C         74      11      35      17      11
4:             D          6       0       2       3       1