Automate group generation according to intervals in string-CodePudding

I have data on covariates for several units. Additionally, I have access to a scoring rule that ranks my observations according to a score.

I decided to divide my training sample X according to the quantiles of score, which I achieved by using the quantile_group function from the GenericMl package.

## Generate data.
set.seed(1986)

n <- 1000
n_val <- 10000
k <- 3

X <- matrix(rnorm(n * k), ncol = k)
X_val <- matrix(rnorm(n_val * k), ncol = k)

score <- rexp(n) 
score_val <- rexp(n_val)   

## Quantiles of score.
library(GenericML)
groups <- quantile_group(score)
head(groups)
#>      [-Inf, 0.277) [0.277, 0.678) [0.678, 1.34) [1.34, Inf]
#> [1,]          TRUE          FALSE         FALSE       FALSE
#> [2,]         FALSE          FALSE         FALSE        TRUE
#> [3,]         FALSE          FALSE          TRUE       FALSE
#> [4,]         FALSE           TRUE         FALSE       FALSE
#> [5,]         FALSE           TRUE         FALSE       FALSE
#> [6,]         FALSE          FALSE          TRUE       FALSE

The g-th column of groups consists of TRUEs and FALSEs denoting membership to the g-th quantile of score. My next step is to divide units in the validation sample X_val using the same partition of groups. To clarify, I want to divide score_val in four groups defined by the intervals given by colnames(groups):

colnames(groups)
#> [1] "[-Inf, 0.277)"  "[0.277, 0.678)" "[0.678, 1.34)"  "[1.34, Inf]"

I need to automate this.

CodePudding user response：

I think this can be an approach to get what you are looking for. I don't use the GenericML package because If I understood well, you only want to divide X_val into sub-sets.

# Load library
  library(dplyr)

# Generate data
  set.seed(1986)
  n <- 1000
  n_val <- 10000
  k <- 3

  X <- matrix(rnorm(n * k), ncol = k)
# Here I use "as.data.frame.matrx" in order to add the group (according to the interval)
  X_val <- as.data.frame.matrix(matrix(rnorm(n_val * k), ncol = k))

  score <- rexp(n) 
  score_val <- rexp(n_val)   

# Get the quantiles of score  
  q.score <- quantile(score)
# Divide score_val acording to the quantiles of q.score
  group.var <- cut(score_val, breaks = c(-Inf, q.score[2:4], Inf))
  
# Add "group.var" to X_val matrix
  X_val$group.var <- group.var

# Divide the information according to "group.var" 
  new_X_val <- X_val %>%
               group_split(group.var)

At the end, what you get is new_X_val, a list with 4 elements, one for each quantile.