Undersampling by groups in R to address class and features imbalance issues in hierarchical data-CodePudding

Let's say I have hierarchical data with heavily imbalanced observations between my target variable and a categorical predictor of interest:

# Load appropriate package
library(tidyverse)

# Set seed for reproducibility 
set.seed(999)

# Create a nested and unbalanced data set 
df <- data.frame(
  Season = sample(as.factor(c("Dry", "Rainy")), replace = TRUE), times=c(90, 10),
  Location = rep(as.factor(c('A', 'B','C', 'D')), times=c(60, 10, 20, 10)), 
  Temperatures = round(rnorm(10000), digits = 2)*10,
  Pressure = round(rnorm(10000), digits = 2)*1000,
  Sunshine = round(rnorm(10000), digits = 2))

# Get number of observations per variables of interest: Season (target variable) and Location (categorical variable of interest); data are unbalanced between the two levels of my target variable, and the 4 levels of my categorical predictor of interest
df %>% 
  group_by(Season, Location) %>% 
  tally()

I am working with XGBoost algorithm to predict Season based on different numeric variables listed above and in respect of the Location, after which I am interpreting model output using SHapley Additive exPlanations. Any attempt to undersample my data set resulted in balancing the target variable (i.e., Season), while I still have important bias towards my nested categorical variable of interest here, the Location. As ranges and variations in temperature, pressure and sunshine are highly dependent on the Location, an uneven number of observations between one or the other location will ultimately result in model bias. Therefore, to take into account effects due to Location, and differences in magnitude, as a conservative approach, I have first standardized my numeric variables in respect of my categorical variable of interest:

# Extract observations for Location A 
df_A <- df %>%
  dplyr::filter(Location == "A") %>%
  dplyr::select_if(is.factor) 

# Standardize observations for Location A 
df_A_scaled <- df %>%
  dplyr::filter(Location == "A") %>%
  ungroup() %>%
  dplyr::select_if(is.numeric) %>%
  scale(center = TRUE, scale = TRUE) %>%
  bind_cols(df_A) %>%
  relocate(where(is.factor))

# Extract observations for Location B 
df_B <- df %>%
  dplyr::filter(Location == "B") %>%
  dplyr::select_if(is.factor) 

# Standardize observations for Location B
df_B_scaled <- df %>%
  dplyr::filter(Location == "B") %>%
  ungroup() %>%
  dplyr::select_if(is.numeric) %>%
  scale(center = TRUE, scale = TRUE) %>%
  bind_cols(df_B) %>%
  relocate(where(is.factor))

# Extract observations for Location C 
df_C <- df %>%
  dplyr::filter(Location == "C") %>%
  dplyr::select_if(is.factor) 

# Standardize observations for Location C
df_C_scaled <- df %>%
  dplyr::filter(Location == "C") %>%
  ungroup() %>%
  dplyr::select_if(is.numeric) %>%
  scale(center = TRUE, scale = TRUE) %>%
  bind_cols(df_C) %>%
  relocate(where(is.factor))

# Extract observations for Location D
df_D <- df %>%
  dplyr::filter(Location == "D") %>%
  dplyr::select_if(is.factor) 

# Standardize observations for Location D
df_D_scaled <- df %>%
  dplyr::filter(Location == "D") %>%
  ungroup() %>%
  dplyr::select_if(is.numeric) %>%
  scale(center = TRUE, scale = TRUE) %>%
  bind_cols(df_D) %>%
  relocate(where(is.factor))

# Bind rows of the different data sets we previously created
df <- bind_rows(df_A_scaled, df_B_scaled, df_C_scaled, df_D_scaled)

# Print the results
print(df)

Then, I manually performed undersampling while randomly deleting examples from the majority class (i.e., the Season) in respect of my categorical variables of interest data are nested within: the location. However, I assume using the following method will be very tedious to be run multiple times, so I was wondering if any other alternative could be foreseen.

# Extract the minimum number of observations per Seasons grouped by Locations
minimum_obs_per_class <- df %>% 
  group_by(Location, Season) %>% 
  tally() %>%
  ungroup() %>%
  select(n) %>%
  summarise(min(n)) %>%
  as.numeric()

# Perform manual undersampling to rebalance the data
df <- df %>% 
    group_by(Location, Season) %>%
    slice_sample(n = minimum_obs_per_class, replace = F)

# Check the results
df %>% 
  group_by(Location, Season) %>%
  tally() %>%
  arrange(n)

CodePudding user response：

Welcome to SO!
You can easily do all your steps with the recipes packages themis for downsampling, plus they are part of a great framework for ML, tidymodels.

library(recipes)
library(themis)
library(dplyr)

# starting from your raw data
downsampled <- df %>% 
  # add a unique dependent
  mutate(dependent = paste0(Season, Location)) %>% 
  # define the recipe
  recipe(dependent   ~ ., data = df1) %>%
  # scale the numeric variables
  step_center(all_numeric_predictors()) %>%
  # normalize the numeric variables
  step_normalize(all_numeric_predictors()) %>%
  # undersampling
  step_downsample(dependent) %>%
  # prep the recipe
  prep() %>%
  # apply it standalone to your data
  bake(new_data = NULL) %>% 
  # remove the useless column
  select(-dependent)

Now some checks:

  table(paste0(downsampled$Season,downsampled$Location))

  DryA   DryB   DryC   DryD RainyA RainyB RainyC RainyD 
   500    500    500    500    500    500    500    500 

head(downsampled)
# A tibble: 6 x 6
  Season times Location Temperatures Pressure Sunshine
  <fct>  <dbl> <fct>           <dbl>    <dbl>    <dbl>
1 Dry     1.00 A              -0.159   0.0433    0.363
2 Dry     1.00 A              -0.520   1.12     -1.39 
3 Dry     1.00 A               0.533   1.31     -1.13 
4 Dry     1.00 A               0.523  -0.322     1.76 
5 Dry     1.00 A               1.33    0.399    -0.420
6 Dry     1.00 A              -0.981  -0.550    -0.440

Note you can use tidymodels end to end, it seems you're going to use xgboost, that is flawlessly integrated in the workflow.