6  Preprocess

Data preprocessing is an important step in data pipelines.

Let’s start with the Sonar dataset and add some missing values for this example.

data(Sonar, package = "mlbench")
Sonar[c(10, 20 , 30 , 40 , 50), 1] <- NA
Sonar[c(15, 25 , 35 , 45 , 55), 2] <- NA

6.1 Check data

To check your data, simply enough use the check_data() function:

check_data(Sonar)
  Sonar: A data.table with 208 rows and 61 columns

  Data types
  * 60 numeric features
  * 0 integer features
  * 1 factor, which is not ordered
  * 0 character features
  * 0 date features

  Issues
  * 0 constant features
  * 0 duplicate cases
  * 2 features include 'NA' values; 10 'NA' values total
    * 2 numeric

  Recommendations
  * Consider imputing missing values or use complete cases only 

The output produces a list of useful information about your dataset, followed by recommendations.

6.2 Preprocess

To clean / preprocess the data, use the preprocess() command. In this case we want to impute missing data. By default, preprocess() uses the missRanger package to predict missing values from the available data using random forest in an iterative procedure.

Sonar.pre <- preprocess(Sonar, impute = TRUE)
02-23-24 13:55:22 Hello, egenn [preprocess]
02-23-24 13:55:22 Imputing missing values using predictive mean matching with missRanger... [preprocess]

Missing value imputation by random forests

  Variables to impute:      V1, V2
  Variables used to impute: V1, V2, V3, V4, V5, V6, V7, V8, V9, V10, V11, V12, V13, V14, V15, V16, V17, V18, V19, V20, V21, V22, V23, V24, V25, V26, V27, V28, V29, V30, V31, V32, V33, V34, V35, V36, V37, V38, V39, V40, V41, V42, V43, V44, V45, V46, V47, V48, V49, V50, V51, V52, V53, V54, V55, V56, V57, V58, V59, V60, Class

iter 1

  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |===================================                                   |  50%
  |                                                                            
  |======================================================================| 100%
iter 2

  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |===================================                                   |  50%
  |                                                                            
  |======================================================================| 100%
iter 3

  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |===================================                                   |  50%
  |                                                                            
  |======================================================================| 100%
iter 4

  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |===================================                                   |  50%
  |                                                                            
  |======================================================================| 100%
02-23-24 13:55:22 Completed in 0.01 minutes (Real: 0.68; User: 1.29; System: 0.08) [preprocess]

Let’s now check our preprocessed data:

check_data(Sonar.pre)
  Sonar.pre: A data.table with 208 rows and 61 columns

  Data types
  * 60 numeric features
  * 0 integer features
  * 1 factor, which is not ordered
  * 0 character features
  * 0 date features

  Issues
  * 0 constant features
  * 0 duplicate cases
  * 0 missing values

  Recommendations
  * Everything looks good 

6.2.1 Preprocessing options

The preprocess() function accepts the following arguments. See its documentation for details.

  • completeCases
  • removeCases.thres
  • removeFeatures.thres
  • missingness
  • impute
  • integer2factor
  • integer2numeric
  • logical2factor
  • logical2numeric
  • numeric2factor
  • numeric2factor.levels
  • len2fac
  • character2factor
  • factorNA2missing
  • factorNA2missing.level
  • scale
  • center
  • removeConstants
  • removeDuplicates
  • oneHot