Partial Least Squares Regression - PLS
When the number of predictors is large or even larger than the number of
observations and the predictors are highly correlated, e.g., when using
spectroscopic data, Partial Least Squares (PLS) regression (Wold et al.,
2001) is frequently used (Brestich et al., 2018; Christler et al., 2021;
Felfödi et al., 2020; Rüdt et al., 2017; Walch et al., 2019). PLS can
easily be applied and has the big advantage that model training is very
fast. PLS transforms the original predictor variables into a set of
latent variables, which are also linear functions of the original
predictors. Linear combinations are determined such that a maximum
covariance between the scores (the values of the latent variables) and
the response is achieved. The number of components (latent variables) is
an optimization parameter. It is usually determined within the framework
of cross-validation (CV). These latent variables are then used as
predictors in a multiple linear regression model for the CQA. If you
want to optimize a PLS model different subsets of predictors should be
considered as inputs as selection of the number of latent variables
alone is usually not sufficient (Walch et al., 2019). Ultimately, PLS
models are a good starting point for modelling CQA when setting up
real-time monitoring or in process development based on spectroscopic
data. However, if you want to implement a real-time prediction model
more advanced methods should be considered.
Structured Additive Regression - STAR
The relationships between the response and the different input variables
are usually non-linear. PLS however, is a linear regression method.
Obviously, non-linearities are present in complex natural product
mixtures. These can be incorporated using Structured Additive Regression
(STAR) models (Fahrmeir et al., 2004), an extension of linear models.
Within this framework the machine learning technique boosting is often
used for variable selection (Hofner et al., 2015; Hofner et al., 2011).
This method works very well when the number of predictors is moderate as
optimization is computationally very intensive (Sauer et al., 2019).
Support Vector Machines Regression - SVM
A further method is Support Vector Machines (SVM) regression (e.g., (Li
et al., 2007)). SVMs can achieve an optimum network structure by a
compromise, balancing the quality of the approximation of the given data
and the complexity of the approximation function. SVMs are boundary
methods, they do not try to model a group of objects. Because the
boundary can be very complex attention should be paid to the problem of
overfitting. SVM was originally used for classification purposes, but
its principles can be extended easily to the task of nonlinear
regression by the introduction of an alternative loss. The basic idea of
SVM regression is to map the original data set into the mapped data set
in a high dimensional feature space via a nonlinear mapping function
(so-called kernel functions), and then perform a linear regression in
this feature space. Defining the linear regression function in this
feature space, nonlinear function regression in the original space
becomes a linear function regression in the feature space. SVMs have
been included in recent studies on continuous biomanufacturing (Nikita
et al., 2022). The selection of an appropriate kernel function is data
dependent and needs expert knowledge. Additionally, hyperparameter
tuning must be performed in order to avoid overfitting.
Gaussian Process Regression
Gaussian Process Regression (GPR) is a non-parametric, Bayesian machine
learning method that infers probability distributions rather than the
exact measured values (Rasmussen et al., 2021). These models are
particularly attractive as they can be used for small data sets, do not
only provide flexible models but also give a model-based estimate of the
prediction error. Recently, Gaussian Process Regression (GPR) has
received increasing attention in the field of biotechnology. di Sciascio
(di Sciascio et al., 2008)used GPR for the development of a biomass
concentration estimator, whereas Hutter-2021 (Hutter et al., 2020) used
GPR to efficiently learn from multiple product spanning process data.
However, in contrast to MLR, different software implementations will
give different error estimates due to different parameter definitions
and different numerical optimization (Erickson et al., 2018). A further
limitation is that GPR they do not scale well with increasing data size.
Tree-based methods
Alternative methods are tree-based models like regression trees or
Random Forests (Breiman, 2001). A regression tree is a hierarchical
model where observations are recursively split into binary partitions
based on their predictor values. Random Forests are ensemble learning
methods where the predictions are obtained by averaging over hundreds or
even thousands of trees built on bootstrap samples, i.e., samples taken
from the training data with replacement. Recently it was shown that
these methods perform very well in downstream processing (Nikita et al.,
2022). Random Forests are very popular due to the build-in
permutation-based variable importance measure. This approach was used to
find suitable inputs for Artificial Neural Networks (Melcher et al.,
2015). Model tuning is very important but setting up a prediction model
is straightforward and fast, especially in comparison to ANNs.
Artificial Neural Networks - ANN
Artificial Neural Networks (e.g., (James et al., 2021b; Lecun et al.,
1989)) are nonlinear statistical models that have been used in
bioprocess modelling since the early days (Glassey et al., 1997). They
can simulate highly nonlinear dynamic relationships of the process
without prior knowledge of the model structure. In a
single-layer-neural-network (often called shallow network, (Lee et al.,
2018)), different linear combinations of the input variables are built
and then a nonlinear function, e.g., the sigmoid function is applied.
Since these new variables are not directly observed, they are called
hidden units, and often they are arranged in a graphical representation
as a hidden layer. The new variables can be used in a linear or
nonlinear regression model resulting in an output variable. In classical
ANNs, more hidden layers can be used but with a great tendency to
overfitting. Further disadvantages of classical ANNs are related to
convergence speed, network topology and bad local minima. After 2010,
neural networks gained further attention mainly in the field of image
classification with the name Deep Learning where “deep” refers to the
number of hidden layers. The architecture of a Convolutional Neural
Network (CNN) was particularly successful (Lecun et al., 1989). They are
now very successful in comparison to other machine learning techniques
due to major computer hardware improvements, the use of graphical
processing units and much larger data sources. Nowadays, modern neural
networks use the ReLU activation function instead of the sigmoid
function and consist of multiple hidden layers. Deep learning was
recently used for real-time quality prediction and process control
(Nikita et al., 2022). They applied deep neural networks and compared it
to SVM, decision trees regression and random forests on time series data
of UV, conductivity and pH probes of 84 batches. In this study, random
forests and decision trees outperformed the deep neural networks maybe
due to the relatively small number of predictors. Rolinger-2021
(Rolinger et al., 2021) compared PLS to CNNs for their ability to
quantify the mAb concentration in the column effluent based on UV and
Raman spectroscopy. In this study there was also no need to use deep
learning as a UV-based PLS model was already sufficiently precise. Deep
learning was developed for large or big data in terms of both
experiments and input variables.
Variable selection and feature importance
For classical ANNs, variable selection had to be performed prior to
setting up the model due to slow convergence. Random forests were
successfully used to come up with an optimal set of inputs for the ANN
(Melcher et al., 2015) and extensive variable selection was performed
prior to establishing the STAR and PLS models (Sauer et al., 2019; Walch
et al., 2019). For this purpose, prior knowledge such as amide bands and
fingerprint regions of the spectral data was used to select informative
predictors. Additionally, highly correlated wavelengths were removed by
reducing the resolution of the spectra. This required a considerable
amount of domain knowledge and engineering skills. In deep learning,
these preparatory steps are no longer required as the method is now
capable of dealing with a very large number of inputs and weights
thereof are computed automatically (Lecun et al., 1989). Of course,
using a very large number of inputs for a prediction model comes at the
expense of interpretability of the model.
Explainable Machine Learning
In the machine learning community, a lot of research is currently going
on in trying to explain what is going on inside prediction models
(Roscher et al., 2020; Zhong et al., 2022). Statistical, purely
data-based models are often called black box models in the field of
hybrid modelling (Simon et al., 2015a; Simon et al., 2015b). Of course,
they are no black boxes. The impact of individual input parameters on
latent variables in PLS or the weights in ANNs can be extracted and
interpreted. If the software including the underlying code to generate a
certain model is freely available such models can be considered as
white-box models as all information is available. Black-box models only
occur if the code is not shared as usually the case in commercial
software (Winter et al., 2021). However, especially if the machine
learning models are based on high-dimensional data, the interpretation
of the model parameters can be complex. This process is more complex
than understanding first-principal models with a clear mechanistic
relationship between the input and output variables. This makes the
efforts in the field of explainable machine learning even more important
and promising for the future of predictive chemometrics.