Home > Blockchain >  How can I use a pipeline graph with upsampling when my task is ordered?
How can I use a pipeline graph with upsampling when my task is ordered?

Time:12-11

I have a task where the observation in rows have a date order. I generate a custom resampling scheme that respects this order in all train/test splits.

I also want to adress the unbalanced classes problem by upsampling the minority class. Within the training sets, the time order is not important (and the learner would not use this anyway).

Now, I want to resample this combination of ordered task, graph learner (including upsampling) and time-sensitive custom resampling scheme. But this is problematic.

To show this, I generated the following code. I use a sample task to make this reproducible and I augment this task with a date column to generate an ordered task that is similar to my problem. This code runs only if I omit the problematic lines indicated in the code. But they generate exactly what I have in my real-world problem: An order. So how can I solve this?

(I omit some of the output in the following reprex for readability.)

library(mlr3verse)
#> Warning: Paket 'mlr3verse' wurde unter R Version 4.1.1 erstellt
#> Lade nötiges Paket: mlr3

library(tidyverse)

library(lubridate)


# load sample task 

task <- tsk("breast_cancer")


#### start of lines that generate a problem

# add a date column to produce an artificial sample problem with time order of rows specified by a date column
DateColumn <- seq(ymd('2000-04-07'),ymd('2021-03-22'), by = '1 week')
DateColumn <- DateColumn[1:task$nrow]
task$cbind(data.frame(Date = DateColumn[1:task$nrow])) # add date column
task$set_col_roles("Date", roles = "order")

#### end of lines that generate a problem


# Generate a "loo" growing window type resampling scheme, where learner is trained on "earlier" and tested on "later" data  (hopefully - or may it be that the original row order is not preserved?)
# first training window size is 10 weeks

length_first_window <- 10

resampling_grow_win = rsmp("custom")

train_sets = list(1:length_first_window)    
test_sets = list(length_first_window 1)        

for (testweek in ((length_first_window 2):task$nrow)) {
  
  train_sets <- append(train_sets, list(c(1:(testweek-1))))
  test_sets <- append(test_sets, list(c(testweek)))
  
}

resampling_grow_win$instantiate(task, train_sets, test_sets)
resampling_grow_win$id <- paste0("gw_for", task$id)



# now, I define a pipeline for a learner with preceding upsampling:

# oversample or undersample such that the number of cases is equal
# I assume here that the actual number is not really important and 
# use approximately the size of the total sample (which will be much larger than the sample size
# for the first growing window resamplings, but I am oversampling anyway)


NrSamplesEach <- 500

po_classbalance = po("classbalancing",
  id = "sample2equal", adjust = "all", shuffle = FALSE, ratio = NrSamplesEach, reference="one")


#create a learner
learner = lrn("classif.ranger", num.trees = 10)


# combine learner with pipeline graph
learner_balanced = as_learner(po_classbalance %>>% learner)

# setup benchmark
rr = resample(task, learner_balanced, resampling_grow_win, store_models = TRUE) 
#> INFO  [16:47:10.484] [mlr3] Applying learner 'sample2equal.classif.ranger' on task 'breast_cancer' (iter 111/673)
#> Error: Cannot rbind data to task 'breast_cancer', missing the following mandatory columns: Date
#> This happened PipeOp sample2equal's $train()

# show some results (I am aware that there is only one element in the test set in each iteration, but this is ok for this example)

scored_result <- rr$score(msr("classif.acc"))
#> Error in eval(expr, envir, enclos): Objekt 'rr' nicht gefunden
head(scored_result)
#> Error in head(scored_result): Objekt 'scored_result' nicht gefunden

Created on 2021-12-09 by the reprex package (v2.0.1)

CodePudding user response:

You could upsample the data set first and then create the custom resampling splits.

library(mlr3)
library(mlr3misc)
library(lubridate)

task = tsk("breast_cancer")

# set date column
DateColumn = seq(ymd('2000-04-07'),ymd('2021-03-22'), by = '1 week')
DateColumn = DateColumn[1:task$nrow]
task$cbind(data.frame(Date = DateColumn[1:task$nrow]))

# upsample task
po_classbalance = po("classbalancing", id = "sample2equal", adjust = "all", shuffle = FALSE, ratio = NrSamplesEach, reference="one")
task = po_classbalance$train(list(task))[[1]]

# add helper column to indicate position in unordered data table
task$cbind(data.frame(i = 1:task$nrow))

# set order
task$set_col_roles("Date", roles = "order")

# custom resampling
train_sets = map(seq(10, task$nrow - 1), seq)
test_sets = as.list(seq(11, task$nrow))

# map to position of unordered data table
data_ordered = task$data(order = TRUE)
train_sets = map(train_sets, function(x) data_ordered$i[x])
test_sets = map(test_sets, function(x) data_ordered$i[x])

# remove helper column
task$select(setdiff(task$feature_names, "i"))

learner = lrn("classif.rpart")
rr = resample(task, learner, resampling_grow_win, store_models = TRUE)

CodePudding user response:

Yep sorry I made a mistake. We need to fix the pipeop. However, you can just order the data first and skip the task$set_col_roles("Date", roles = "order") part. Just to be safe, check with task$data(row) that your data is returned in chronological order e.g. task$data(1) returns the first time point.

library(mlr3)
library(mlr3pipelines)
library(data.table)
library(mlr3misc)

task = tsk("breast_cancer")
learner = lrn("classif.rpart")
resampling = rsmp("holdout")

# extract data
data = task$data()

# fake date column
date = sample(seq(task$nrow))
data[, date := date]

# order data in chronological order
setorder(data, date)

# remove date column
data[, date := NULL]

# create task with ordered data
task = as_task_classif(data, target = "class")

# set custom resampling
train_sets = map(seq(10, task$nrow - 1), seq)
test_sets = as.list(seq(11, task$nrow))
resampling = rsmp("custom")
resampling$instantiate(task, train_sets, test_sets)

# learner with upsampling
po_classbalance = po("classbalancing", id = "sample2equal", adjust = "all", shuffle = FALSE, ratio = 500, reference="one")
learner_balanced = as_learner(po_classbalance %>>% learner)

# resample
rr = resample(task, learner_balanced, resampling, store_models = TRUE)
  • Related