Short Communication

Chinese Hamster Ovary (CHO) cells are widely used to manufacture complex biotherapeutic molecules at large scales. Industrial bioprocesses ensure high product yield and quality by maintaining favorable growth conditions in cell culture environments, which requires careful monitoring and control of nutrient availability. Chemically-defined serum-free media can contain dozens or >100 components(Ritacco et al., 2018), but key nutrients include proteinogenic amino acids, which are direct substrates and regulators(Duarte et al., 2014; Fomina‐Yadlin et al., 2014) of proliferation and protein synthesis. Unfortunately, conventional methods for amino acid quantification based on liquid chromatography and mass spectrometry are time-consuming and difficult to use for decision making and control of cell culture. Alternate spectroscopic approaches have been sensitive to a limited number of amino acid species(Bhatia et al., 2018). Here we present a computational method to forecast time-course amino acid concentrations from routine bioprocess measurements, facilitating a timely and anticipatory control of the bioprocess (Figure 1).
At the foundation of our method is a genome-scale metabolic network model, which accounts for the complex conversion from media nutrients to biomass and recombinant protein production. Such models have been increasingly utilized for CHO cells(Hefzi et al., 2016; Calmels et al., 2019; Huang & Yoon, 2020) and bioprocess applications(Sommeregger et al., 2017; Zhang & Hua, 2016), such as predicting clonal performances(Popp et al., 2016), identifying metabolic bottlenecks(Zhuangrong & Seongkyu, 2020), and optimizing media formulation(Fouladiha et al., 2020; Traustason et al., 2019). Metabolic network models can also estimate amino acid uptake rates necessary to experimentally support observed proliferation and productivity(Chen et al., 2019). However, challenges have limited their practical application.
First, metabolic network models are typically highly complex but under-constrained, and therefore are easy to overfit. This is mitigated by training the model on a variety of bioprocess conditions and metabolic phenotypes. Second, metabolic network models assume that cells operate at some metabolic optimum, and thus tend to describe an idealized metabolism specifically fit to the assumed objective (e.g., biomass production(Feist & Palsson, 2010; Szeliova et al., 2020), minimization of redox(Savinell & Palsson, 1992)). Third, for the present purpose, these models need to predict amino acid consumption fluxes, typically on the order of 10-3mmol·gDW-1·hr-1(see Methods), from input data that are multiple magnitudes larger, such as growth rate and glucose consumption (10-1 to 10-2mmol·gDW-1·hr-1). The preceding two challenges increase prediction error. Lastly, metabolic network models assume a steady state, which reduces the range of forecast. Typically, input data from one day are used to make predictions for the same day. However, such predictions cannot be extended to multiple days or subsequent culture phases, as cross-temporal shifts in metabolism would violate the steady state assumption. Thus, model predictions of amino acid concentrations can be overfit, ideal, and near-sighted – all of which dilutes their practicality for industrial bioprocess control. Here we demonstrate that these weaknesses can be addressed in a data-driven manner by coupling a metabolic network model with machine learning.
We developed this hybrid approach on a diverse set of 10 CHO clones with different growth and productivity profiles from two different fed-batch production processes. These CHO clones were subject to different bioprocess conditions and recombinant antibody identities (see Methods), resulting in a variety of phenotypes and productivity performances (Fig. S1). For example, several high-performing clones were exceptionally proliferative or productive, suggesting an efficient conversion from nutrients to biomass or recombinant protein product. Other clones performed these conversions at lower rates, suggesting attenuated metabolic activity or inefficient resource utilization. The CHO cells adjusted their nutrient uptake according to these various metabolic phenotypes, leading to diverse amino acid consumption patterns (Fig, S2). For example, the consumption of glucose and serine differed by several fold across conditions and time. Furthermore, different clones varied in their consumption or secretion of key metabolites such as lactate, alanine, glycine, and glutamine.
We sought to predict these diverse consumption behaviors using a tailored model of CHO metabolism(Schinn et al., 2020). As input information, we utilized the following routinely measured industrial bioprocess data: (1) viable cell density and titer measurements, from which growth rate and specific productivity are calculated (Methods, equation 1), and (2) bioreactor concentrations of glucose, lactate, glutamate and glutamine, from which their respective consumption rates are calculated. These measurements were used as boundary conditions by constraining the fluxes of biomass production, recombinant protein synthesis and consumption of the four metabolites to observed values. Subsequently, we used Markov chain Monte Carlo sampling of metabolic fluxes(Schellenberger et al., 2011) to sample the range and magnitude of all reaction fluxes to calculate the likely uptake fluxes of the remaining 18 proteinogenic amino acids (see Methods). These predictions were applied to the CHO clones across 8 days of a 12-day production run (days 4 to 11), resulting in a total of 80 individual predictions.
We evaluated the resulting model predictions in two ways. First, we examined the differences in model predictions and experimental measurements of amino acid uptake and secretion (Figure 2A). For most amino acids, this difference was small compared to the scale of input data, suggesting that metabolic models can describe the conversion from nutrients to biomass and recombinent proteins. Second, we examined the fold changes between model predictions and experimental measurements. These fold change errors are summarized in Figure 2B by their mean and variance across the 80 observations. Overall, fold change error varied significantly across amino acids. For example, the model predicted some essential amino acids consistently well – e.g. phenylalanine, cysteine and tryptophan (fold change ≈ 1), but predicted others poorly – e.g. alanine, lysine, glycine, and methionine (fold change ≈ 0). Overall, the sizeable fold change errors for many amino acids confirm the difficulty of using metabolic network models alone to predict amino acid consumption.
Notably, the model systematically underestimated consumption rates for almost all amino acids (fold change < 1). This is likely because the model doesn’t consider certain metabolic inefficiencies – e.g. CHO cells consume more amino acids than needed for the observed production of biomass and recombinant protein, and catabolize them as byproducts(Mulukutla et al., 2017). Furthermore, the variance of fold change error was relatively low (≤1) for most amino acids. This suggests that the difference between model ideality and biological reality remained consistent across many clones and conditions.
We hypothesized that this consistent gap could be bridged with data and statistical modeling. We constructed a series of linear regression models to ‘correct’ the predictions from metabolic modeling, using growth rate and the predictions from the metabolic model as explanatory variables (Methods, equation 2). The 80 observations were randomly divided into a training dataset and validation dataset, consisting of 48 and 32 observations, respectively. The regression coefficients were first estimated from the training dataset and then applied to the validation dataset. According to validation results, the regression models substantially improved predictions, as fold change error approached unity for most amino acids (Fig. 3B). As exceptions, predictions for alanine, glycine and histidine were not reliably improved (Fig. 3, red). These results were replicated in additional validation studies involving four distinct clones (Supplementary Document).
These results show that our hybrid modeling approach estimates amino acid consumptions well for a small timescale of 1 day, when the steady state assumption holds true. This assumption is not valid at larger timescales of multiple days, where nutrient consumption declines asymptotically as cellular metabolism shifts from exponential growth phase to stationary phase. However, we found this limitation could be remedied by modeling the multi-phase decline in amino acid consumption with a simple sigmoid function (Methods, equation 3; line in Fig. 4), which can be fitted from only a few datapoints. Specifically, we further adapted our hybrid modeling approach by first predicting amino acid consumption rates of several early culture days as heretofore described. Then, these datapoints were used to fit a sigmoid function that described the entire consumption profile, including later culture days (Fig. 4A). Using this approach, we accurately predicted the time-course consumption rates of 13 out of 18 amino acids (Spearman ρ > 0.65; Fig. 4B), with only few amino acids remaining difficult to predict (alanine, glycine, and histidine). Notably, our approach accurately predicted the consumption profiles of amino acids that are highly abundant in recombinant antibodies (e.g. serine, valine, and leucine)(Fan et al., 2015), or that complicate media formulation due to low solubility (e.g. tyrosine). These results highlight the method’s value in monitoring and forecasting the bioreactor environment.
In summary, the presented modeling workflow forecasted the entire amino acid consumption profile from early bioprocess measurements, facilitating anticipatory and in situ control of bioreactor nutrient availability. This was realized by a novel combination of metabolic and statistical models. A metabolic network model estimated amino acid uptake rates necessary for observed proliferation and productivity, assuming an ideally efficient metabolism and steady state conditions. Two subsequent regression models refined these predictions by offsetting prediction errors empirically and by describing the time-course relationship of individual predictions. Our efforts are part of a growing trend of synergizing metabolic network models with machine learning methods(Zampieri et al., 2019), and demonstrates the power of hybrid modeling for on-line control of bioprocesses.