Figure 4: Categories of models based on the underlying data and assumptions exemplified for the mass balance of a chromatographic separation. Mechanistic models are based on biophysical relationship and first principle knowledge packed into a mathematical model. These models are sometimes synonymously called “white box models” as the have to be based on full process understanding. All parameters are determined by independent experiments. Formechanistic models with fetch factor individual parameters are determined experimentally, they are data driven. Data driven models are mathematical models solely based on statistical relationships of data -mostly derived by online sensors and offline analysis. They are not based on biophysical relations. Therefore, they are also called “black box models”. However, these models depict the entity of the process space. Hybrid models combine mechanistic and data driven models and can compensate the first ones for lack of understanding of processes by data collected throughout the process. They are of special interest for bioprocess modelling as in most cases those processes are not fully understood and describable.
Multiple linear regression (MLR) is one of the most frequently used methods in multivariate data analysis. It can easily be applied and interpretation of the effect of the individual predictors on the response is straightforward. However, this is only true if there are many more observations than predictors in the model and there is no correlation structure among the predictors. In such situations linear regression can be used to model a clearly understood mechanistic relationship between variables. On the other hand, if the predictors are highly correlated, interpretation of the individual effects of those predictors is no longer possible. Individual effects can be masked by several correlated variables. Therefore, it is mandatory to check the correlation structure between the input variables prior to setting up an MLR model. In the case of multi-collinearity, no solution can be calculated at all as the covariance matrix becomes singular (Varmuza et al., 2009). In many practical situations MLR is still used even if the predictors included in the model are moderately correlated. In such situations the model is selected due to fitting the response well rather than for interpretability of the individual effects of the input variables. This is a very pragmatic and purely data-driven approach where model selection is based solely on the performance measure. Spurious correlations are used, there is no need for a causative correlation, because we exploit the data structure, e.g., we control and measure the pumps, pH, conductivity and pressure in a chromatography run and search for a relationship with multiple sensors. Finally, we can predict CQAs.
Model Selection, Training and Overfitting
For model selection the performance of different models on so-called training data must be evaluated accordingly. Typical performance measures are the (MSE), the root-mean-squared-error (RMSE) or the mean relative deviation (MRD).
The MSE measures the squared differences between the observed (measured) values and the ones predicted by the model and takes the mean over all these differences. The MSE is often used during model building and parameter optimization. However, it is not suitable for evaluation of the prediction error as it is a squared quantity. To get an idea of the size of the error the root of the MSE is used, the RMSE, which is in the same unit as the CQA that was modelled. The MRD is the mean absolute difference between the observed and the predicted value standardized by the observed value. Often it is given in percentages.
It is not valid to calculate a performance measure on the data the model was built on. Instead, an independent test set needs to be used. A split of the available data into a training set (ca. 50% of the objects), a validation set (ca. 25% of the objects) and a test set (ca 25% of the objects) is often recommended (James et al., 2021a; Varmuza et al., 2009).
Here the training data is used for model building and the validation data is used for parameter optimization. This split of the data is very useful to avoid overfitting, i.e., the model performs very good on the training data but does not generalize well to new data the model has not seen during model building. The performance of different models can then be evaluated on the independent test set in order to check how well it generalizes. If only a limited number of observations are available for model building the split must be adapted. Depending on the statistical method used, an individual validation set might not be necessary as parameter optimization is done on the training data via cross-validation, e.g., when using PLS. In this case 75% or even 80% of the data can be used for model training (Westad et al., 2015).
Experimental Design vs Design of Experiments – how much experimental data do we need?
A model will only perform well on the test set if the study was designed properly. Still, as we are working with biological material the choice of splitting the data into training – validation – and test set has a major impact on the performance measure. Therefore, multiple splits and/or cross validation as an efficient method to reuse the data are recommended (Gareth et al., 2021; James et al., 2013, 2021a). When designing a study, it is important to keep the goal of the study in mind. The prediction model could be used in a biopharmaceutical production process where the process conditions are fixed, and only little deviations are expected. Alternatively, real-time monitoring could also be used during process development where many different process conditions need to be explored. In (Felfödi et al., 2020) it was shown how the precision of the analytical method also impacts the data needed to set up a prediction model. For model training, the number of fractions for the off-line analysis together with the number of chromatographic runs performed is crucial. The prediction model can only be of high quality if the off-line analysis is very accurate (expressed as coefficient of variation). Therefore, a compromise must be found between the number of fractions analyzed, the number of chromatographic runs and the performance of the prediction model expressed as RMSE (Figure 5).