Figure 4: Categories of models based on the underlying data and
assumptions exemplified for the mass balance of a chromatographic
separation. Mechanistic models are based on biophysical
relationship and first principle knowledge packed into a mathematical
model. These models are sometimes synonymously called “white box
models” as the have to be based on full process understanding. All
parameters are determined by independent experiments. Formechanistic models with fetch factor individual parameters are
determined experimentally, they are data driven. Data driven
models are mathematical models solely based on statistical
relationships of data -mostly derived by online sensors and offline
analysis. They are not based on biophysical relations. Therefore, they
are also called “black box models”. However, these models depict the
entity of the process space. Hybrid models combine mechanistic
and data driven models and can compensate the first ones for lack of
understanding of processes by data collected throughout the process.
They are of special interest for bioprocess modelling as in most cases
those processes are not fully understood and describable.
Multiple linear regression (MLR) is one of the most frequently used
methods in multivariate data analysis. It can easily be applied and
interpretation of the effect of the individual predictors on the
response is straightforward. However, this is only true if there are
many more observations than predictors in the model and there is no
correlation structure among the predictors. In such situations linear
regression can be used to model a clearly understood mechanistic
relationship between variables. On the other hand, if the predictors are
highly correlated, interpretation of the individual effects of those
predictors is no longer possible. Individual effects can be masked by
several correlated variables. Therefore, it is mandatory to check the
correlation structure between the input variables prior to setting up an
MLR model. In the case of multi-collinearity, no solution can be
calculated at all as the covariance matrix becomes singular (Varmuza et
al., 2009). In many practical situations MLR is still used even if the
predictors included in the model are moderately correlated. In such
situations the model is selected due to fitting the response well rather
than for interpretability of the individual effects of the input
variables. This is a very pragmatic and purely data-driven approach
where model selection is based solely on the performance measure.
Spurious correlations are used, there is no need for a causative
correlation, because we exploit the data structure, e.g., we control and
measure the pumps, pH, conductivity and pressure in a chromatography run
and search for a relationship with multiple sensors. Finally, we can
predict CQAs.
Model Selection, Training and Overfitting
For model selection the performance of different models on so-called
training data must be evaluated accordingly. Typical performance
measures are the (MSE), the root-mean-squared-error (RMSE) or the mean
relative deviation (MRD).
The MSE measures the squared differences between the observed (measured)
values and the ones predicted by the model and takes the mean over all
these differences. The MSE is often used during model building and
parameter optimization. However, it is not suitable for evaluation of
the prediction error as it is a squared quantity. To get an idea of the
size of the error the root of the MSE is used, the RMSE, which is in the
same unit as the CQA that was modelled. The MRD is the mean absolute
difference between the observed and the predicted value standardized by
the observed value. Often it is given in percentages.
It is not valid to calculate a performance measure on the data the model
was built on. Instead, an independent test set needs to be used. A split
of the available data into a training set (ca. 50% of the objects), a
validation set (ca. 25% of the objects) and a test set (ca 25% of the
objects) is often recommended (James et al., 2021a; Varmuza et al.,
2009).
Here the training data is used for model building and the validation
data is used for parameter optimization. This split of the data is very
useful to avoid overfitting, i.e., the model performs very good on the
training data but does not generalize well to new data the model has not
seen during model building. The performance of different models can then
be evaluated on the independent test set in order to check how well it
generalizes. If only a limited number of observations are available for
model building the split must be adapted. Depending on the statistical
method used, an individual validation set might not be necessary as
parameter optimization is done on the training data via
cross-validation, e.g., when using PLS. In this case 75% or even 80%
of the data can be used for model training (Westad et al., 2015).
Experimental Design vs Design of Experiments – how much experimental
data do we need?
A model will only perform well on the test set if the study was designed
properly. Still, as we are working with biological material the choice
of splitting the data into training – validation – and test set has a
major impact on the performance measure. Therefore, multiple splits
and/or cross validation as an efficient method to reuse the data are
recommended (Gareth et al., 2021; James et al., 2013, 2021a). When
designing a study, it is important to keep the goal of the study in
mind. The prediction model could be used in a biopharmaceutical
production process where the process conditions are fixed, and only
little deviations are expected. Alternatively, real-time monitoring
could also be used during process development where many different
process conditions need to be explored. In (Felfödi et al., 2020) it was
shown how the precision of the analytical method also impacts the data
needed to set up a prediction model. For model training, the number of
fractions for the off-line analysis together with the number of
chromatographic runs performed is crucial. The prediction model can only
be of high quality if the off-line analysis is very accurate (expressed
as coefficient of variation). Therefore, a compromise must be found
between the number of fractions analyzed, the number of chromatographic
runs and the performance of the prediction model expressed as RMSE
(Figure 5).