9  Decomposition / Dimensionality Reduction

Use select_decom() to get a listing of available decomposition algorithms:

select_decom()
.:select_decom
rtemis supports the following decomposition algorithms:

    Name                                   Description
   H2OAE                               H2O Autoencoder
 H2OGLRM                H2O Generalized Low-Rank Model
     ICA                Independent Component Analysis
  Isomap                                        Isomap
    KPCA           Kernel Principal Component Analysis
     LLE                      Locally Linear Embedding
     MDS                      Multidimensional Scaling
     NMF             Non-negative Matrix Factorization
     PCA                  Principal Component Analysis
    SPCA           Sparse Principal Component Analysis
     SVD                  Singular Value Decomposition
    TSNE   t-distributed Stochastic Neighbor Embedding
    UMAP Uniform Manifold Approximation and Projection

We can further divide decomposition algorithms into linear (e.g. PCA, ICA, NMF) and nonlinear dimensionality reduction, (also called manifold learning, like LLE and tSNE).

9.0.1 Linear Dimensionality Reduction

As a simple example, let’s look the famous iris dataset. Note that we use this to demonstrate usage and is not a good example to assess the effectiveness of decomposition algorithms as the iris dataset consists of only 4 variables.

First, we select all variables from the iris dataset, excluding the group names, i.e. the labels. Since the iris dataset includes one duplicate observation, we can remove using preprocess(). This is required for t-SNE to work.

x <- preprocess(iris[, 1:4], removeDuplicates = TRUE)
01-07-24 00:23:44 Hello, egenn [preprocess]
01-07-24 00:23:44 Removing 1 duplicate case... [preprocess]
01-07-24 00:23:44 Completed in 1.7e-05 minutes (Real: 1e-03; User: 1e-03; System: 0.00) [preprocess]


Now, let’s try a few different algorithms, projecting to two dimensions and visualizing using [mplot3_xy]. Notice we are using the real labels to colo points in these examples:

9.0.1.1 Principal Component Analysic (PCA)

iris.PCA <- d_PCA(x)
01-07-24 00:23:44 Hello, egenn [d_PCA]
01-07-24 00:23:44 ||| Input has dimensions 149 rows by 4 columns, [d_PCA]
01-07-24 00:23:44     interpreted as 149 cases with 4 features. [d_PCA]
01-07-24 00:23:44 Performing Principal Component Analysis... [d_PCA]
01-07-24 00:23:44 Completed in 5e-05 minutes (Real: 3e-03; User: 3e-03; System: 0.00) [d_PCA]

mplot3_xy(iris.PCA$projections.train[, 1], iris.PCA$projections.train[, 2],
          group = iris$Species, main = "PCA on iris", 
          xlab = "1st PCA component", ylab = "2nd PCA component")

9.0.1.2 Independent Component Analysis (ICA)

iris.ICA <- d_ICA(x, k = 2)
01-07-24 00:23:44 Hello, egenn [d_ICA]
01-07-24 00:23:44 ||| Input has dimensions 149 rows by 4 columns, [d_ICA]
01-07-24 00:23:44     interpreted as 149 cases with 4 features. [d_ICA]
01-07-24 00:23:44 Running Independent Component Analysis... [d_ICA]
01-07-24 00:23:44 Completed in 1.3e-04 minutes (Real: 0.01; User: 2e-03; System: 1e-03) [d_ICA]

mplot3_xy(iris.ICA$projections.train[, 1], iris.ICA$projections.train[, 2],
          group = iris$Species, main = "ICA on iris",
          xlab = "1st ICA component", ylab = "2nd ICA component")

9.0.1.3 Non-negative Matrix Factorization (NMF)

iris.NMF <- d_NMF(x, k = 2)
01-07-24 00:23:44 Hello, egenn [d_NMF]
01-07-24 00:23:45 ||| Input has dimensions 149 rows by 4 columns, [d_NMF]
01-07-24 00:23:45     interpreted as 149 cases with 4 features. [d_NMF]
01-07-24 00:23:45 Running Non-negative Matrix Factorization... [d_NMF]
01-07-24 00:23:45 Completed in 0.01 minutes (Real: 0.84; User: 0.78; System: 0.04) [d_NMF]

mplot3_xy(iris.NMF$projections.train[, 1], iris.NMF$projections.train[, 2],
          group = iris$Species, main = "NMF on iris",
          xlab = "1st NMF component", ylab = "2nd NMF component")

9.0.2 Non-linear dimensionality reduction

9.0.2.1 Isomap

iris.Isomap <- d_Isomap(x, k = 2)
01-07-24 00:23:45 Hello, egenn [d_Isomap]
01-07-24 00:23:46 ||| Input has dimensions 149 rows by 4 columns, [d_Isomap]
01-07-24 00:23:46     interpreted as 149 cases with 4 features. [d_Isomap]
01-07-24 00:23:46 Running Isomap... [d_Isomap]
01-07-24 00:23:46 Completed in 0.01 minutes (Real: 0.49; User: 0.45; System: 0.02) [d_Isomap]

mplot3_xy(iris.Isomap$projections.train[, 1],
          iris.Isomap$projections.train[, 2],
          group = iris$Species, main = "Isomap on iris",
          xlab = "1st Isomap projection", ylab = "2nd Isomap projection")

9.0.2.2 t-distributed Stochastic Neighbor Embedding (t-SNE)

iris.tSNE <- d_TSNE(x, k = 2, perplexity = 10)
01-07-24 00:23:46 Hello, egenn [d_TSNE]
01-07-24 00:23:46 Running t-distributed Stochastic Neighbot Embedding [d_TSNE]
01-07-24 00:23:46 ||| Input has dimensions 149 rows by 4 columns, [d_TSNE]
01-07-24 00:23:46     interpreted as 149 cases with 4 features. [d_TSNE]
01-07-24 00:23:46 Running t-SNE... [d_TSNE]
Performing PCA
Read the 149 x 4 data matrix successfully!
Using no_dims = 2, perplexity = 10.000000, and theta = 0.000000
Computing input similarities...
Symmetrizing...
Done in 0.00 seconds!
Learning embedding...
Iteration 50: error is 55.517882 (50 iterations in 0.01 seconds)
Iteration 100: error is 52.572941 (50 iterations in 0.01 seconds)
Iteration 150: error is 53.532609 (50 iterations in 0.01 seconds)
Iteration 200: error is 54.248812 (50 iterations in 0.01 seconds)
Iteration 250: error is 53.744669 (50 iterations in 0.01 seconds)
Iteration 300: error is 1.527791 (50 iterations in 0.01 seconds)
Iteration 350: error is 0.553761 (50 iterations in 0.01 seconds)
Iteration 400: error is 0.337229 (50 iterations in 0.01 seconds)
Iteration 450: error is 0.311688 (50 iterations in 0.01 seconds)
Iteration 500: error is 0.303008 (50 iterations in 0.01 seconds)
Iteration 550: error is 0.297851 (50 iterations in 0.01 seconds)
Iteration 600: error is 0.294886 (50 iterations in 0.01 seconds)
Iteration 650: error is 0.292748 (50 iterations in 0.01 seconds)
Iteration 700: error is 0.291122 (50 iterations in 0.01 seconds)
Iteration 750: error is 0.289852 (50 iterations in 0.01 seconds)
Iteration 800: error is 0.288819 (50 iterations in 0.01 seconds)
Iteration 850: error is 0.287948 (50 iterations in 0.01 seconds)
Iteration 900: error is 0.287181 (50 iterations in 0.01 seconds)
Iteration 950: error is 0.286513 (50 iterations in 0.01 seconds)
Iteration 1000: error is 0.285925 (50 iterations in 0.01 seconds)
Fitting performed in 0.16 seconds.
01-07-24 00:23:46 Completed in 3e-03 minutes (Real: 0.18; User: 0.17; System: 0.01) [d_TSNE]

mplot3_xy(iris.tSNE$projections.train[, 1], iris.tSNE$projections.train[, 2],
          group = iris$Species, main = "tSNE on iris",
          xlab = "1st tSNE component", ylab = "2nd tSNE component")