I have data on covariates for several units. Additionally, I have access to a scoring rule that ranks my observations according to a score.
I decided to divide my training sample X
according to the quantiles of score
, which I achieved by using the quantile_group
function from the GenericMl
package.
## Generate data.
set.seed(1986)
n <- 1000
n_val <- 10000
k <- 3
X <- matrix(rnorm(n * k), ncol = k)
X_val <- matrix(rnorm(n_val * k), ncol = k)
score <- rexp(n)
score_val <- rexp(n_val)
## Quantiles of score.
library(GenericML)
groups <- quantile_group(score)
head(groups)
#> [-Inf, 0.277) [0.277, 0.678) [0.678, 1.34) [1.34, Inf]
#> [1,] TRUE FALSE FALSE FALSE
#> [2,] FALSE FALSE FALSE TRUE
#> [3,] FALSE FALSE TRUE FALSE
#> [4,] FALSE TRUE FALSE FALSE
#> [5,] FALSE TRUE FALSE FALSE
#> [6,] FALSE FALSE TRUE FALSE
The g-th column of groups
consists of TRUE
s and FALSE
s denoting membership to the g-th quantile of score
. My next step is to divide units in the validation sample X_val
using the same partition of groups
. To clarify, I want to divide score_val
in four groups defined by the intervals given by colnames(groups)
:
colnames(groups)
#> [1] "[-Inf, 0.277)" "[0.277, 0.678)" "[0.678, 1.34)" "[1.34, Inf]"
I need to automate this.
CodePudding user response:
I think this can be an approach to get what you are looking for. I don't use the GenericML
package because If I understood well, you only want to divide X_val
into sub-sets.
# Load library
library(dplyr)
# Generate data
set.seed(1986)
n <- 1000
n_val <- 10000
k <- 3
X <- matrix(rnorm(n * k), ncol = k)
# Here I use "as.data.frame.matrx" in order to add the group (according to the interval)
X_val <- as.data.frame.matrix(matrix(rnorm(n_val * k), ncol = k))
score <- rexp(n)
score_val <- rexp(n_val)
# Get the quantiles of score
q.score <- quantile(score)
# Divide score_val acording to the quantiles of q.score
group.var <- cut(score_val, breaks = c(-Inf, q.score[2:4], Inf))
# Add "group.var" to X_val matrix
X_val$group.var <- group.var
# Divide the information according to "group.var"
new_X_val <- X_val %>%
group_split(group.var)
At the end, what you get is new_X_val
, a list with 4 elements, one for each quantile.