14  Boosting

Boosting is one of the most powerful techniques in supervised learning. rtemis allows you to easily apply boosting to any learner for regression using boost().

Let’s create some synthetic data:

set.seed(2018)
x <- rnormmat(500, 50)
colnames(x) <- paste0("Feature", 1:50)
w <- rnorm(50)
y <- x %*% w + rnorm(500)
dat <- data.frame(x, y)
res <- resample(dat, seed = 2018)
01-07-24 00:31:47 Input contains more than one columns; will stratify on last [resample]
.:Resampling Parameters
    n.resamples: 10 
      resampler: strat.sub 
   stratify.var: y 
        train.p: 0.75 
   strat.n.bins: 4 
01-07-24 00:31:47 Created 10 stratified subsamples [resample]

dat.train <- dat[res$Subsample_1, ]
dat.valid <- dat[-res$Subsample_1, ]

14.1 Boost CART stumps

Boosting works best by training a series of many weak learners. Let’s start by boosting the simplest trees, those with depth = 1, a.k.a. stumps.

boost.cart <- boost(dat.train, x.valid = dat.valid,
                    mod = 'cart',
                    maxdepth = 1,
                    max.iter = 50)
01-07-24 00:31:47 Hello, egenn [boost]

.:Regression Input Summary
Training features: 374 x 50 
 Training outcome: 374 x 1 
 Testing features: Not available
  Testing outcome: Not available
.:Parameters
               mod: CART 
        mod.params:  
                    maxdepth: 1 
              init: -0.182762669446564 
          max.iter: 50 
     learning.rate: 0.1 
         tolerance: 0 
   tolerance.valid: 1e-05 
01-07-24 00:31:47 [ Boosting Classification and Regression Trees... ] [boost]
01-07-24 00:31:47 Iteration #5: Training MSE = 49.08; Validation MSE = 52.02 [boost]
01-07-24 00:31:48 Iteration #10: Training MSE = 45.91; Validation MSE = 49.65 [boost]
01-07-24 00:31:48 Iteration #15: Training MSE = 43.30; Validation MSE = 47.54 [boost]
01-07-24 00:31:48 Iteration #20: Training MSE = 40.92; Validation MSE = 45.75 [boost]
01-07-24 00:31:48 Iteration #25: Training MSE = 38.78; Validation MSE = 44.10 [boost]
01-07-24 00:31:48 Iteration #30: Training MSE = 36.85; Validation MSE = 42.97 [boost]
01-07-24 00:31:48 Iteration #35: Training MSE = 35.08; Validation MSE = 41.76 [boost]
01-07-24 00:31:48 Iteration #40: Training MSE = 33.45; Validation MSE = 40.69 [boost]
01-07-24 00:31:48 Iteration #45: Training MSE = 31.93; Validation MSE = 39.59 [boost]
01-07-24 00:31:48 Iteration #50: Training MSE = 30.53; Validation MSE = 38.68 [boost]
01-07-24 00:31:48 Reached max iterations [boost]


.:Regression Training Summary
    MSE = 30.53 (42.94%)
   RMSE = 5.53 (24.46%)
    MAE = 4.36 (23.22%)
      r = 0.82 (p = 1.3e-91)
   R sq = 0.43
01-07-24 00:31:48 Completed in 0.01 minutes (Real: 0.64; User: 0.58; System: 0.06) [boost]

We notice the validation error is quite higher than the training error and is also less smooth.

14.2 Boost CART stumps: step slower

To get better results out of boosting, it usually helps to decrease the learning rate and increase the number of steps. From an optimization point of view, the lower learning rate does not mean that you simply take more, smallet steps instead of fewer bigger steps, but it makes you follow a different, more precise optimization path.

boost.cart <- boost(dat.train, x.valid = dat.valid, mod = 'cart',
                    maxdepth = 1,
                    max.iter = 500, learning.rate = .05,
                    print.progress.every = 100)
01-07-24 00:31:50 Hello, egenn [boost]

.:Regression Input Summary
Training features: 374 x 50 
 Training outcome: 374 x 1 
 Testing features: Not available
  Testing outcome: Not available
.:Parameters
               mod: CART 
        mod.params:  
                    maxdepth: 1 
              init: -0.182762669446564 
          max.iter: 500 
     learning.rate: 0.05 
         tolerance: 0 
   tolerance.valid: 1e-05 
01-07-24 00:31:50 [ Boosting Classification and Regression Trees... ] [boost]
01-07-24 00:31:51 Iteration #100: Training MSE = 30.84; Validation MSE = 38.78 [boost]
01-07-24 00:31:53 Iteration #200: Training MSE = 20.96; Validation MSE = 31.99 [boost]
01-07-24 00:31:54 Iteration #300: Training MSE = 15.16; Validation MSE = 27.27 [boost]
01-07-24 00:31:55 Iteration #400: Training MSE = 11.38; Validation MSE = 23.75 [boost]
01-07-24 00:31:56 Iteration #500: Training MSE = 8.78; Validation MSE = 21.26 [boost]
01-07-24 00:31:56 Reached max iterations [boost]


.:Regression Training Summary
    MSE = 8.78 (83.58%)
   RMSE = 2.96 (59.48%)
    MAE = 2.36 (58.35%)
      r = 0.95 (p = 1.1e-195)
   R sq = 0.84
01-07-24 00:31:56 Completed in 0.10 minutes (Real: 6.23; User: 5.71; System: 0.50) [boost]

14.3 Boost deep CARTs

Let’s see what can go wrong if your base learners are too strong:

boost.cart <- boost(dat.train, x.valid = dat.valid, mod = 'cart',
                    maxdepth = 20,
                    max.iter = 50)
01-07-24 00:32:16 Hello, egenn [boost]

.:Regression Input Summary
Training features: 374 x 50 
 Training outcome: 374 x 1 
 Testing features: Not available
  Testing outcome: Not available
.:Parameters
               mod: CART 
        mod.params:  
                    maxdepth: 20 
              init: -0.182762669446564 
          max.iter: 50 
     learning.rate: 0.1 
         tolerance: 0 
   tolerance.valid: 1e-05 
01-07-24 00:32:16 [ Boosting Classification and Regression Trees... ] [boost]
01-07-24 00:32:16 Iteration #5: Training MSE = 25.97; Validation MSE = 43.99 [boost]
01-07-24 00:32:16 Iteration #10: Training MSE = 12.59; Validation MSE = 36.27 [boost]
01-07-24 00:32:16 Iteration #15: Training MSE = 6.06; Validation MSE = 32.99 [boost]
01-07-24 00:32:16 Iteration #20: Training MSE = 2.96; Validation MSE = 30.99 [boost]
01-07-24 00:32:16 Iteration #25: Training MSE = 1.46; Validation MSE = 30.02 [boost]
01-07-24 00:32:16 Iteration #30: Training MSE = 0.72; Validation MSE = 29.26 [boost]
01-07-24 00:32:16 Iteration #35: Training MSE = 0.35; Validation MSE = 28.53 [boost]
01-07-24 00:32:16 Iteration #40: Training MSE = 0.17; Validation MSE = 28.07 [boost]
01-07-24 00:32:16 Iteration #45: Training MSE = 0.09; Validation MSE = 27.85 [boost]
01-07-24 00:32:16 Iteration #50: Training MSE = 0.04; Validation MSE = 27.62 [boost]
01-07-24 00:32:16 Reached max iterations [boost]


.:Regression Training Summary
    MSE = 0.04 (99.92%)
   RMSE = 0.20 (97.21%)
    MAE = 0.16 (97.10%)
      r = 1.00 (p = 0.00)
   R sq = 1.00
01-07-24 00:32:16 Completed in 0.01 minutes (Real: 0.64; User: 0.58; System: 0.05) [boost]

We notice that training error quickly approached zero, while testing error remained high, i.e. the strong base learners overfit the data.

14.4 Boost any learner

While decision trees are the most common base learners used in boosting, you can boost any algorithm:

14.4.1 Projection pursuit Regression

boost.ppr <- boost(dat.train, x.valid = dat.valid,
                   mod = 'ppr', max.iter = 10)
01-07-24 00:32:19 Hello, egenn [boost]

.:Regression Input Summary
Training features: 374 x 50 
 Training outcome: 374 x 1 
 Testing features: Not available
  Testing outcome: Not available
.:Parameters
               mod: PPR 
        mod.params: (empty list) 
              init: -0.182762669446564 
          max.iter: 10 
     learning.rate: 0.1 
         tolerance: 0 
   tolerance.valid: 1e-05 
01-07-24 00:32:19 [ Boosting Projection Pursuit Regression... ] [boost]
01-07-24 00:32:19 Iteration #5: Training MSE = 18.78; Validation MSE = 20.45 [boost]
01-07-24 00:32:20 Iteration #10: Training MSE = 6.62; Validation MSE = 7.99 [boost]
01-07-24 00:32:20 Reached max iterations [boost]


.:Regression Training Summary
    MSE = 6.62 (87.62%)
   RMSE = 2.57 (64.82%)
    MAE = 2.00 (64.78%)
      r = 1.00 (p = 0.00)
   R sq = 0.88
01-07-24 00:32:20 Completed in 0.02 minutes (Real: 1.03; User: 1.01; System: 0.01) [boost]

14.4.2 Multivariate Adaptive Regression Splines (MARS)

boost.mars <- boost(dat.train,x.valid = dat.valid,
                    mod = 'mars', max.iter = 30)
01-07-24 00:32:20 Hello, egenn [boost]

.:Regression Input Summary
Training features: 374 x 50 
 Training outcome: 374 x 1 
 Testing features: Not available
  Testing outcome: Not available
.:Parameters
               mod: MARS 
        mod.params: (empty list) 
              init: -0.182762669446564 
          max.iter: 30 
     learning.rate: 0.1 
         tolerance: 0 
   tolerance.valid: 1e-05 
01-07-24 00:32:20 [ Boosting Multivariate Adaptive Regression Splines... ] [boost]
01-07-24 00:32:21 Iteration #5: Training MSE = 25.85; Validation MSE = 31.00 [boost]
01-07-24 00:32:21 Iteration #10: Training MSE = 18.45; Validation MSE = 24.42 [boost]
01-07-24 00:32:22 Iteration #15: Training MSE = 13.70; Validation MSE = 19.16 [boost]
01-07-24 00:32:22 Iteration #20: Training MSE = 11.65; Validation MSE = 17.89 [boost]
01-07-24 00:32:23 Iteration #25: Training MSE = 9.82; Validation MSE = 16.40 [boost]
01-07-24 00:32:23 Iteration #30: Training MSE = 8.89; Validation MSE = 16.03 [boost]
01-07-24 00:32:23 Reached max iterations [boost]


.:Regression Training Summary
    MSE = 8.89 (83.39%)
   RMSE = 2.98 (59.25%)
    MAE = 2.35 (58.65%)
      r = 0.95 (p = 1e-192)
   R sq = 0.83
01-07-24 00:32:23 Completed in 0.05 minutes (Real: 3.20; User: 3.05; System: 0.11) [boost]