Statistical Analysis and Empirical Modeling Methods
Experimental conditions were performed in biological replicates and
resulting data is presented with the standard error of the mean. CVC and
mAb titer were normalized to the maximum CVC and titer of the respective
process control conditions. Graphical analysis, standard error
calculations, ANOVA, and one-way student’s t-tests were performed in
Microsoft Excel and SAS JMP (SAS, USA), respectively.
SIMCA-P+ (Sartorius Stedium, GE) was used for MVDA modeling and detailed
methods for PCA and PLS regression are described by Wold et al (MKS
Umetrics AB, 2013). However, in short, multi-variate methods begin with
dimensional reduction where high-dimensional datasets can be reduced to
a lower dimensional space and be explained by fewer variables (i.e.
latent variables). The latent variables are calculated by unit
normalization of the data followed by a projection of the normalized
data on a lower dimensional space, and lastly finding the directions of
the greatest variance within that space or the eigen vector (Mevik &
Wehrens, 2007). The total variable contribution towards the greatest
variance in any given direction is described by the first latent
variable or in the case of PCA, the first principal component. Every
subsequent component within a PCA model is orthogonal or perpendicular
to the preceding component and is aimed at explaining the total variance
of a dataset. Accordingly, each component is a summation of the
individual variable contributions or loadings towards the variance.
Interestingly, Wold et al found that the sum of every variable
contribution across all the components in a model could be represented
as a variable importance to projection (VIP) which can provide a
heuristic multivariate ranking system (Akarachantachote, Chadcham, &
Saithanu, 2013; Prieto et al., 2014). As a result, dimensional reduction
models can be used to not only provide a holistic distribution of
batches or even individual observations, but also identify key variable
contributions that explain the distribution between observations and
ultimately, pose as potential targets for optimization efforts.
In contrast, the calculation of latent variables is modified when using
supervised approaches such as PLS, as the goal for latency is the
direction of the greatest co-variance between the explanatory variables
and the response variable(s). Since the first predictive component in a
PLS model is not the first principal component, PLS models become very
powerful when there is a high degree of collinearity between the
explanatory variables and the response variable. However, when the
variables are nonlinear in behavior, such as in the case of amino acid
stoichiometric balances, the goodness of fit (R2) and
the goodness of prediction (Q2) of a PLS model are
significantly impacted as a high degree of information is unaccounted
for in the explanatory variable space. To circumvent this issue,
Orthogonal Partial Least Squares (OPLS) was used in which the first
component or the predictive component is forced to be the first
principal component and every subsequent orthogonal component aims to
explain the remaining co-variance between the explanatory and response
variables. Consequently, OPLS models yield better predictions and
increased model interpretability towards nonlinear response variables as
a greater degree of information in the explanatory variable space is
considered (Bylesjo M et al., 2006; Yamamoto et al., 2009). Accordingly,
OPLS models built with stoichiometric balances were found to have higher
Q2 values than PLS models (data not presented).
Lastly, to incorporate the time-dependent contributions of the amino
acid consumption rates and resulting stoichiometric balances, the
training set data matrix was transformed to a batch-level model (BLM)
format. In a BLM format, every variable at every day measured across the
batches was formatted to become an independent variable. As a result,
each row of a BLM data matrix represented a single batch with every cell
culture variable expanded by the number of timepoints it was measured as
additional columns, forming a wider and shorter data matrix. The
benefits of the BLM format over the untransformed dataset format were
two-fold: (1) each variable in the model was able to provide a specific
time-dependent contribution that was representative of a particular
instance in a batch and (2) increased precision with the OPLS algorithm
as each component of the model is a weighted average of all the
variables and increased number of variables provided a greater
reliability towards the calculation of that weighted average (Vajargah,
Sadeghi-Bazargani, Mehdizadeh-Esfanjani, Savadi-Oskouei, & Farhoudi,
2012; Worley & Powers, 2013; Worley & Powers, 2015). For instance, all
20 amino acid stoichiometric balances that was captured from the 25
training batches were measured every other day from day 0 to day 14 and
interpolated for the unmeasured days resulting in 15 observations per
batch or 375 observations across all 25 batches. But when transformed
into a BLM format, the 20 original stoichiometric balance variables were
expanded to 300 variables capturing the stoichiometric balance at each
time point (day 0 – day 14) and the observations collapsed to 25 rows
representing one observation per batch. The complexity and
interpretation of the resulting matrix was drastically reduced into
latent variables using the dimensional reduction thus highlighting the
benefit of MVDA modeling capabilities and its ability to retain
biological time-dependent information for a large set of variables to
rapidly identify key modulators to improve bioprocess development
efforts.