Partial Least Squares Regression - PLS
When the number of predictors is large or even larger than the number of observations and the predictors are highly correlated, e.g., when using spectroscopic data, Partial Least Squares (PLS) regression (Wold et al., 2001) is frequently used (Brestich et al., 2018; Christler et al., 2021; Felfödi et al., 2020; Rüdt et al., 2017; Walch et al., 2019). PLS can easily be applied and has the big advantage that model training is very fast. PLS transforms the original predictor variables into a set of latent variables, which are also linear functions of the original predictors. Linear combinations are determined such that a maximum covariance between the scores (the values of the latent variables) and the response is achieved. The number of components (latent variables) is an optimization parameter. It is usually determined within the framework of cross-validation (CV). These latent variables are then used as predictors in a multiple linear regression model for the CQA. If you want to optimize a PLS model different subsets of predictors should be considered as inputs as selection of the number of latent variables alone is usually not sufficient (Walch et al., 2019). Ultimately, PLS models are a good starting point for modelling CQA when setting up real-time monitoring or in process development based on spectroscopic data. However, if you want to implement a real-time prediction model more advanced methods should be considered.
Structured Additive Regression - STAR
The relationships between the response and the different input variables are usually non-linear. PLS however, is a linear regression method. Obviously, non-linearities are present in complex natural product mixtures. These can be incorporated using Structured Additive Regression (STAR) models (Fahrmeir et al., 2004), an extension of linear models. Within this framework the machine learning technique boosting is often used for variable selection (Hofner et al., 2015; Hofner et al., 2011). This method works very well when the number of predictors is moderate as optimization is computationally very intensive (Sauer et al., 2019).
Support Vector Machines Regression - SVM
A further method is Support Vector Machines (SVM) regression (e.g., (Li et al., 2007)). SVMs can achieve an optimum network structure by a compromise, balancing the quality of the approximation of the given data and the complexity of the approximation function. SVMs are boundary methods, they do not try to model a group of objects. Because the boundary can be very complex attention should be paid to the problem of overfitting. SVM was originally used for classification purposes, but its principles can be extended easily to the task of nonlinear regression by the introduction of an alternative loss. The basic idea of SVM regression is to map the original data set into the mapped data set in a high dimensional feature space via a nonlinear mapping function (so-called kernel functions), and then perform a linear regression in this feature space. Defining the linear regression function in this feature space, nonlinear function regression in the original space becomes a linear function regression in the feature space. SVMs have been included in recent studies on continuous biomanufacturing (Nikita et al., 2022). The selection of an appropriate kernel function is data dependent and needs expert knowledge. Additionally, hyperparameter tuning must be performed in order to avoid overfitting.
Gaussian Process Regression
Gaussian Process Regression (GPR) is a non-parametric, Bayesian machine learning method that infers probability distributions rather than the exact measured values (Rasmussen et al., 2021). These models are particularly attractive as they can be used for small data sets, do not only provide flexible models but also give a model-based estimate of the prediction error. Recently, Gaussian Process Regression (GPR) has received increasing attention in the field of biotechnology. di Sciascio (di Sciascio et al., 2008)used GPR for the development of a biomass concentration estimator, whereas Hutter-2021 (Hutter et al., 2020) used GPR to efficiently learn from multiple product spanning process data. However, in contrast to MLR, different software implementations will give different error estimates due to different parameter definitions and different numerical optimization (Erickson et al., 2018). A further limitation is that GPR they do not scale well with increasing data size.
Tree-based methods
Alternative methods are tree-based models like regression trees or Random Forests (Breiman, 2001). A regression tree is a hierarchical model where observations are recursively split into binary partitions based on their predictor values. Random Forests are ensemble learning methods where the predictions are obtained by averaging over hundreds or even thousands of trees built on bootstrap samples, i.e., samples taken from the training data with replacement. Recently it was shown that these methods perform very well in downstream processing (Nikita et al., 2022). Random Forests are very popular due to the build-in permutation-based variable importance measure. This approach was used to find suitable inputs for Artificial Neural Networks (Melcher et al., 2015). Model tuning is very important but setting up a prediction model is straightforward and fast, especially in comparison to ANNs.
Artificial Neural Networks - ANN
Artificial Neural Networks (e.g., (James et al., 2021b; Lecun et al., 1989)) are nonlinear statistical models that have been used in bioprocess modelling since the early days (Glassey et al., 1997). They can simulate highly nonlinear dynamic relationships of the process without prior knowledge of the model structure. In a single-layer-neural-network (often called shallow network, (Lee et al., 2018)), different linear combinations of the input variables are built and then a nonlinear function, e.g., the sigmoid function is applied. Since these new variables are not directly observed, they are called hidden units, and often they are arranged in a graphical representation as a hidden layer. The new variables can be used in a linear or nonlinear regression model resulting in an output variable. In classical ANNs, more hidden layers can be used but with a great tendency to overfitting. Further disadvantages of classical ANNs are related to convergence speed, network topology and bad local minima. After 2010, neural networks gained further attention mainly in the field of image classification with the name Deep Learning where “deep” refers to the number of hidden layers. The architecture of a Convolutional Neural Network (CNN) was particularly successful (Lecun et al., 1989). They are now very successful in comparison to other machine learning techniques due to major computer hardware improvements, the use of graphical processing units and much larger data sources. Nowadays, modern neural networks use the ReLU activation function instead of the sigmoid function and consist of multiple hidden layers. Deep learning was recently used for real-time quality prediction and process control (Nikita et al., 2022). They applied deep neural networks and compared it to SVM, decision trees regression and random forests on time series data of UV, conductivity and pH probes of 84 batches. In this study, random forests and decision trees outperformed the deep neural networks maybe due to the relatively small number of predictors. Rolinger-2021 (Rolinger et al., 2021) compared PLS to CNNs for their ability to quantify the mAb concentration in the column effluent based on UV and Raman spectroscopy. In this study there was also no need to use deep learning as a UV-based PLS model was already sufficiently precise. Deep learning was developed for large or big data in terms of both experiments and input variables.
Variable selection and feature importance
For classical ANNs, variable selection had to be performed prior to setting up the model due to slow convergence. Random forests were successfully used to come up with an optimal set of inputs for the ANN (Melcher et al., 2015) and extensive variable selection was performed prior to establishing the STAR and PLS models (Sauer et al., 2019; Walch et al., 2019). For this purpose, prior knowledge such as amide bands and fingerprint regions of the spectral data was used to select informative predictors. Additionally, highly correlated wavelengths were removed by reducing the resolution of the spectra. This required a considerable amount of domain knowledge and engineering skills. In deep learning, these preparatory steps are no longer required as the method is now capable of dealing with a very large number of inputs and weights thereof are computed automatically (Lecun et al., 1989). Of course, using a very large number of inputs for a prediction model comes at the expense of interpretability of the model.
Explainable Machine Learning
In the machine learning community, a lot of research is currently going on in trying to explain what is going on inside prediction models (Roscher et al., 2020; Zhong et al., 2022). Statistical, purely data-based models are often called black box models in the field of hybrid modelling (Simon et al., 2015a; Simon et al., 2015b). Of course, they are no black boxes. The impact of individual input parameters on latent variables in PLS or the weights in ANNs can be extracted and interpreted. If the software including the underlying code to generate a certain model is freely available such models can be considered as white-box models as all information is available. Black-box models only occur if the code is not shared as usually the case in commercial software (Winter et al., 2021). However, especially if the machine learning models are based on high-dimensional data, the interpretation of the model parameters can be complex. This process is more complex than understanding first-principal models with a clear mechanistic relationship between the input and output variables. This makes the efforts in the field of explainable machine learning even more important and promising for the future of predictive chemometrics.