16 Handling Imbalanced Data

library(rtemis)

In classification problems, it is common for outcome classes to appear with different frequencies. This is called imbalanced data. Consider, for example, a binary classification problem where the positive class (the ‘events’) appears with a 5% probability. Applying a learning algorithm naively without considering this class imbalance, may lead to the algorithm always predicting the majority class, which automatically results in 95% accuracy.

To handle imbalanced data, we make considerations during model training and assessment.

16.1 Model Training

There are a few different ways to address the problem of imbalanced data during training. We’ll consider the 3 main ones:

Inverse Frequency Weighting
We weigh each case based on its frequency, such that less frequent classes are up-weighed. This is called Inverse Frequency Weighting (IFW), and is enabled by default in rtemis for all classification learning algorithms that support case weights. The logical argument ifw controls whether IFW is used. It is TRUE by default in all learners.
Upsampling the minority class
We randomly sample from the minority class to reach the size of the manjority class. The effect is not very different from upweighing using IFW. The logical argument upsample in all rtemis learners that support classification controls whether upsampling of the minority class should be performed. (If it is set to TRUE, it makes the ifw argument irrelevant as the sample becomes balanced)
Downsampling the majority class
Conversely, we randomly subsample the majority class to reach the size of the minority class. The logical argument downsample controls this behavior.

16.2 Classification model performance metrics

During model selection as well as model assessment, it is crucial to use metrics that take into consideration imbalanced outcomes.
The following metrics address the issue in different ways and are reported by the modError function in all classification problems:

Balanced Accuracy (the mean of Sensitivity + Sensitivity) \[\frac{1}{N}\sum_{i=1}^k Sensitivity_i\] i.e. the mean per-class Sensitivity. In the binary case, this is equal to the mean of Sensitivity and Specificity.
F1 Harmonic mean of Sensitivity (aka Recall) and Positive Predictive Value (aka Precision) \[F_1 = 2\frac{precision * recall}{precision + recall}\]
AUROC (Area under the ROC) i.e. the area under the True Positive Rate vs False Positive Rate curve or Sensitivity vs 1-Specificity

16.3 Example dataset

Let’s look at a very imbalanced dataset from the Penn ML Benchmarks repository

dat <- read("https://github.com/EpistasisLab/pmlb/raw/master/datasets/hypothyroid/hypothyroid.tsv.gz")

02-23-24 13:55:54 ▶ Reading hypothyroid.tsv.gz using data.table... [read]
02-23-24 13:55:55 Read in 3,163 x 26 [read]
02-23-24 13:55:55 Removed 77 duplicate rows. [read]
02-23-24 13:55:55 New dimensions: 3,086 x 26 [read]
02-23-24 13:55:55 Completed in 0.02 minutes (Real: 0.91; User: 0.07; System: 0.01) [read]

dat$target <- factor(dat$target, levels = c(1, 0))

check_data(dat)

  dat: A data.table with 3086 rows and 26 columns

  Data types
  * 0 numeric features
  * 25 integer features
  * 1 factor, which is not ordered
  * 0 character features
  * 0 date features

  Issues
  * 0 constant features
  * 0 duplicate cases
  * 0 missing values

  Recommendations
  * Everything looks good

Get the frequency of the target classes:

table(dat$target)


   1    0 
2945  141

16.3.1 Class Imbalance

We can use the Class Imbalance formula using the class_imbalance() function:

\[I = K\cdot\sum_{i=1}^K (n_i/N - 1/K)^2\]

class_imbalance(dat$target)

[1] 0.8255895

Let’s create some resamples to train and test models:

res <- resample(dat, seed = 2019)

02-23-24 13:55:56 Input contains more than one columns; will stratify on last [resample]
.:Resampling Parameters
    n.resamples: 10 
      resampler: strat.sub 
   stratify.var: y 
        train.p: 0.75 
   strat.n.bins: 4 
02-23-24 13:55:56 Using max n bins possible = 2 [strat.sub]
02-23-24 13:55:56 Created 10 stratified subsamples [resample]

dat.train <- dat[res$Subsample_1, ]
dat.test <- dat[-res$Subsample_1, ]

16.4 GLM

16.4.1 No imbalance correction

Let’s train a GLM without inverse probability weighting or upsampling. Since IFW is set to TRUE by default in all rtemis supervised learning functions that support it, we have to set it to FALSE:

mod.glm.imb <- s_GLM(dat.train, dat.test,
                     ifw = FALSE)

02-23-24 13:55:56 Hello, egenn [s_GLM]

.:Classification Input Summary
Training features: 2313 x 25 
 Training outcome: 2313 x 1 
 Testing features: 773 x 25 
  Testing outcome: 773 x 1 

02-23-24 13:55:56 Training GLM... [s_GLM]

.:LOGISTIC Classification Training Summary
                   Reference 
        Estimated  1     0   
                1  2195  72
                0    13  33

                   Overall  
      Sensitivity  0.9941 
      Specificity  0.3143 
Balanced Accuracy  0.6542 
              PPV  0.9682 
              NPV  0.7174 
               F1  0.9810 
         Accuracy  0.9633 
              AUC  0.9431 

  Positive Class:  1 

.:LOGISTIC Classification Testing Summary
                   Reference 
        Estimated  1    0   
                1  728  24
                0    9  12

                   Overall  
      Sensitivity  0.9878 
      Specificity  0.3333 
Balanced Accuracy  0.6606 
              PPV  0.9681 
              NPV  0.5714 
               F1  0.9778 
         Accuracy  0.9573 
              AUC  0.9177 

  Positive Class:  1 
02-23-24 13:55:56 Completed in 1.8e-03 minutes (Real: 0.11; User: 0.10; System: 0.01) [s_GLM]

We get almost perfect Sensitivity, but very low Specificity.

16.4.2 IFW

Let’s enable IFW:

mod.glm.ifw <- s_GLM(dat.train, dat.test,
                     ifw = TRUE)

02-23-24 13:55:56 Hello, egenn [s_GLM]

02-23-24 13:55:56 Imbalanced classes: using Inverse Frequency Weighting [prepare_data]

.:Classification Input Summary
Training features: 2313 x 25 
 Training outcome: 2313 x 1 
 Testing features: 773 x 25 
  Testing outcome: 773 x 1 

02-23-24 13:55:56 Training GLM... [s_GLM]

.:LOGISTIC Classification Training Summary
                   Reference 
        Estimated  1     0   
                1  1912   8
                0   296  97

                   Overall  
      Sensitivity  0.8659 
      Specificity  0.9238 
Balanced Accuracy  0.8949 
              PPV  0.9958 
              NPV  0.2468 
               F1  0.9264 
         Accuracy  0.8686 
              AUC  0.9469 

  Positive Class:  1 

.:LOGISTIC Classification Testing Summary
                   Reference 
        Estimated  1    0   
                1  624   6
                0  113  30

                   Overall  
      Sensitivity  0.8467 
      Specificity  0.8333 
Balanced Accuracy  0.8400 
              PPV  0.9905 
              NPV  0.2098 
               F1  0.9129 
         Accuracy  0.8461 
              AUC  0.9085 

  Positive Class:  1 
02-23-24 13:55:56 Completed in 9.7e-04 minutes (Real: 0.06; User: 0.05; System: 3e-03) [s_GLM]

Sensitivity dropped a little, but Specificity improved a lot and they are now very close.

16.4.3 Upsampling

Let’s try upsampling instead of IFW:

mod.glm.ups <- s_GLM(dat.train, dat.test,
                     ifw = FALSE,
                     upsample = TRUE)

02-23-24 13:55:56 Hello, egenn [s_GLM]

02-23-24 13:55:56 Upsampling to create balanced set... [prepare_data]
02-23-24 13:55:56 1 is majority outcome with length = 2208 [prepare_data]

.:Classification Input Summary
Training features: 4416 x 25 
 Training outcome: 4416 x 1 
 Testing features: 773 x 25 
  Testing outcome: 773 x 1 

02-23-24 13:55:56 Training GLM... [s_GLM]

.:LOGISTIC Classification Training Summary
                   Reference 
        Estimated  1     0     
                1  1913   124
                0   295  2084

                   Overall  
      Sensitivity  0.8664 
      Specificity  0.9438 
Balanced Accuracy  0.9051 
              PPV  0.9391 
              NPV  0.8760 
               F1  0.9013 
         Accuracy  0.9051 
              AUC  0.9476 

  Positive Class:  1 

.:LOGISTIC Classification Testing Summary
                   Reference 
        Estimated  1    0   
                1  630   6
                0  107  30

                   Overall  
      Sensitivity  0.8548 
      Specificity  0.8333 
Balanced Accuracy  0.8441 
              PPV  0.9906 
              NPV  0.2190 
               F1  0.9177 
         Accuracy  0.8538 
              AUC  0.9086 

  Positive Class:  1 
02-23-24 13:55:56 Completed in 1.9e-03 minutes (Real: 0.11; User: 0.11; System: 0.01) [s_GLM]

In this example, upsampling the minority class helped give almost perfect Specificity at the cost of lower Sensitivity.

16.4.4 Downsampling

mod.glm.downs <- s_GLM(dat.train, dat.test,
                       ifw = FALSE,
                       downsample = TRUE)

02-23-24 13:55:56 Hello, egenn [s_GLM]

02-23-24 13:55:56 Downsampling to balance outcome classes... [prepare_data]
02-23-24 13:55:56 0 is the minority outcome with 105 cases [prepare_data]

.:Classification Input Summary
Training features: 210 x 25 
 Training outcome: 210 x 1 
 Testing features: 773 x 25 
  Testing outcome: 773 x 1 

02-23-24 13:55:56 Training GLM... [s_GLM]

.:LOGISTIC Classification Training Summary
                   Reference 
        Estimated  1   0   
                1  96   6
                0   9  99

                   Overall  
      Sensitivity  0.9143 
      Specificity  0.9429 
Balanced Accuracy  0.9286 
              PPV  0.9412 
              NPV  0.9167 
               F1  0.9275 
         Accuracy  0.9286 
              AUC  0.9640 

  Positive Class:  1 

.:LOGISTIC Classification Testing Summary
                   Reference 
        Estimated  1    0   
                1  608   3
                0  129  33

                   Overall  
      Sensitivity  0.8250 
      Specificity  0.9167 
Balanced Accuracy  0.8708 
              PPV  0.9951 
              NPV  0.2037 
               F1  0.9021 
         Accuracy  0.8292 
              AUC  0.9129 

  Positive Class:  1 
02-23-24 13:55:56 Completed in 3.7e-04 minutes (Real: 0.02; User: 0.02; System: 2e-03) [s_GLM]

Similar results to upsampling, in this case.

16.5 Random forest

Some algorithms allow multiple ways to handle imbalanced data. See this Tech Report for techniques to handle imbalanced classes with Random Forest. The report describes the “Balanced Random Forest” and “Weighted Random Forest” approaches.

16.5.1 No imbalance correction

Again, let’s begin by training a model with no correction for imbalanced data:

mod.rf.imb <- s_Ranger(dat.train, dat.test,
                       ifw = FALSE)

02-23-24 13:55:56 Hello, egenn [s_Ranger]

.:Classification Input Summary
Training features: 2313 x 25 
 Training outcome: 2313 x 1 
 Testing features: 773 x 25 
  Testing outcome: 773 x 1 

.:Parameters
   n.trees: 1000 
      mtry: NULL 

02-23-24 13:55:56 Training Random Forest (ranger) Classification with 1000 trees... [s_Ranger]

.:Ranger Classification Training Summary
                   Reference 
        Estimated  1     0    
                1  2207    1
                0     1  104

                   Overall  
      Sensitivity  0.9995 
      Specificity  0.9905 
Balanced Accuracy  0.9950 
              PPV  0.9995 
              NPV  0.9905 
               F1  0.9995 
         Accuracy  0.9991 
              AUC  1.0000 

  Positive Class:  1 

.:Ranger Classification Testing Summary
                   Reference 
        Estimated  1    0   
                1  732  13
                0    5  23

                   Overall  
      Sensitivity  0.9932 
      Specificity  0.6389 
Balanced Accuracy  0.8161 
              PPV  0.9826 
              NPV  0.8214 
               F1  0.9879 
         Accuracy  0.9767 
              AUC  0.9776 

  Positive Class:  1 
02-23-24 13:55:57 Completed in 0.02 minutes (Real: 0.98; User: 1.65; System: 0.06) [s_Ranger]

16.5.2 IFW: Case weights

Now, with IFW. By Default, s_Ranger(), uses IFW to define case weights (i.e. ifw.case.weights = TRUE):

mod.rf.ifw <- s_Ranger(dat.train, dat.test,
                       ifw = TRUE)

02-23-24 13:55:57 Hello, egenn [s_Ranger]

02-23-24 13:55:57 Imbalanced classes: using Inverse Frequency Weighting [prepare_data]

.:Classification Input Summary
Training features: 2313 x 25 
 Training outcome: 2313 x 1 
 Testing features: 773 x 25 
  Testing outcome: 773 x 1 

.:Parameters
   n.trees: 1000 
      mtry: NULL 

02-23-24 13:55:57 Training Random Forest (ranger) Classification with 1000 trees... [s_Ranger]

.:Ranger Classification Training Summary
                   Reference 
        Estimated  1     0    
                1  2193    0
                0    15  105

                   Overall  
      Sensitivity  0.9932 
      Specificity  1.0000 
Balanced Accuracy  0.9966 
              PPV  1.0000 
              NPV  0.8750 
               F1  0.9966 
         Accuracy  0.9935 
              AUC  1.0000 

  Positive Class:  1 

.:Ranger Classification Testing Summary
                   Reference 
        Estimated  1    0   
                1  728   9
                0    9  27

                   Overall  
      Sensitivity  0.9878 
      Specificity  0.7500 
Balanced Accuracy  0.8689 
              PPV  0.9878 
              NPV  0.7500 
               F1  0.9878 
         Accuracy  0.9767 
              AUC  0.9817 

  Positive Class:  1 
02-23-24 13:55:58 Completed in 0.02 minutes (Real: 0.94; User: 1.74; System: 0.06) [s_Ranger]

Again, IFW increases the Specificity.

16.5.3 IFW: Class weights

Alternatively, we can use IFW to define class weights:

mod.rf.cw <- s_Ranger(dat.train, dat.test,
                      ifw = TRUE,
                      ifw.case.weights = FALSE,
                      ifw.class.weights = TRUE)

02-23-24 13:55:58 Hello, egenn [s_Ranger]

02-23-24 13:55:58 Imbalanced classes: using Inverse Frequency Weighting [prepare_data]

.:Classification Input Summary
Training features: 2313 x 25 
 Training outcome: 2313 x 1 
 Testing features: 773 x 25 
  Testing outcome: 773 x 1 

.:Parameters
   n.trees: 1000 
      mtry: NULL 

02-23-24 13:55:58 Training Random Forest (ranger) Classification with 1000 trees... [s_Ranger]

.:Ranger Classification Training Summary
                   Reference 
        Estimated  1     0    
                1  2208    1
                0     0  104

                   Overall  
      Sensitivity  1.0000 
      Specificity  0.9905 
Balanced Accuracy  0.9952 
              PPV  0.9995 
              NPV  1.0000 
               F1  0.9998 
         Accuracy  0.9996 
              AUC  1.0000 

  Positive Class:  1 

.:Ranger Classification Testing Summary
                   Reference 
        Estimated  1    0   
                1  732  13
                0    5  23

                   Overall  
      Sensitivity  0.9932 
      Specificity  0.6389 
Balanced Accuracy  0.8161 
              PPV  0.9826 
              NPV  0.8214 
               F1  0.9879 
         Accuracy  0.9767 
              AUC  0.9800 

  Positive Class:  1 
02-23-24 13:55:59 Completed in 0.02 minutes (Real: 0.92; User: 1.59; System: 0.05) [s_Ranger]

16.5.4 Upsampling

Now try upsampling:

mod.rf.ups <- s_Ranger(dat.train, dat.test,
                       ifw = FALSE,
                       upsample = TRUE)

02-23-24 13:55:59 Hello, egenn [s_Ranger]

02-23-24 13:55:59 Upsampling to create balanced set... [prepare_data]
02-23-24 13:55:59 1 is majority outcome with length = 2208 [prepare_data]

.:Classification Input Summary
Training features: 4416 x 25 
 Training outcome: 4416 x 1 
 Testing features: 773 x 25 
  Testing outcome: 773 x 1 

.:Parameters
   n.trees: 1000 
      mtry: NULL 

02-23-24 13:55:59 Training Random Forest (ranger) Classification with 1000 trees... [s_Ranger]

.:Ranger Classification Training Summary
                   Reference 
        Estimated  1     0     
                1  2206     0
                0     2  2208

                   Overall  
      Sensitivity  0.9991 
      Specificity  1.0000 
Balanced Accuracy  0.9995 
              PPV  1.0000 
              NPV  0.9991 
               F1  0.9995 
         Accuracy  0.9995 
              AUC  1.0000 

  Positive Class:  1 

.:Ranger Classification Testing Summary
                   Reference 
        Estimated  1    0   
                1  728  13
                0    9  23

                   Overall  
      Sensitivity  0.9878 
      Specificity  0.6389 
Balanced Accuracy  0.8133 
              PPV  0.9825 
              NPV  0.7188 
               F1  0.9851 
         Accuracy  0.9715 
              AUC  0.9830 

  Positive Class:  1 
02-23-24 13:56:01 Completed in 0.03 minutes (Real: 1.76; User: 3.44; System: 0.08) [s_Ranger]