Figure 3. Projected changes in global mean temperature (top row) and energy balance at the TOA (N) (bottom row). Each panel shows changes in the AOGCM (x-axis) against the EBM2 emulation (y-axis). Each point represents an annual mean during 1915-2014.

3.4 Future near-surface temperature projections

We compare temperature emulations for the twenty-first century from EBM2 based on the different methods for calibrating λ and ε (Figure 4). Results are shown for five of the eight AOGCMs where the most complete CMIP6 data is available. Results for other models and experiments are shown in Figure S4.
The performance of the abrupt-4xCO2 calibration varies greatly between the AOGCMs and typically performs worse than the step model (Figure S4). For four of the AOGCMs, the emulations of SSP2-4.5 deteroriate during the twenty-first century. The errors in the emulations are correlated with the magnitude of the forcing and peak near the end of the twenty-first century for total and GHG forcing and early in the twenty-first century for aerosol forcing. The exception is MIROC6 for which the abrupt-4xCO2 calibrated EBM2 performs well throughout 1850-2100 and across the three simulations. For NorESM2-LM, SSP2-4.5 is relatively closely emulated but SSP2-4.5-AER is not. Optimization of the λ and ε parameters (the “1850-2100” calibration in Figure 4) yielded close emulations for all of the AOGCMs and across the three experiments. Similarly close emulations were also achieved by minimizing the RMSE over 2015-2100 (not shown). Minimizing the RMSE for the later years of the projection, when the temperature anomalies are largest, is key.
The “1850-2014” calibration yields a close emulation of temperatures to 2014 but errors increase strongly after the calibration period. Extending the calibration period from 1850-2014 to 1850-2040 (not shown) does improve the emulation to 2040 but not always after 2040. Importantly, it does not mitigate the risk of large emulation errors outside the calibration period and its impact varies greatly between AOGCMs and between different experiments for the same AOGCM.
To investigate the impact of using a calibration from one experiment for a different experiment, the “1850-2100” calibration from SSP2-4.5 was applied to the SSP2-4.5-GHG and SSP2-4.5-AER experiments (the “SSP2-4.5” calibration in Figure 4). For both SSP2-4.5-GHG and SSP2-4.5-AER, the error for the “SSP2-4.5” calibration is greater than for the “1850-2100” calibration. The impact also varies between models and experiments in terms of the size of the impact and its temporal behaviour. For CanESM5 for instance, the difference in temperature emulation is evident early in the twentieth century for SSP2-4.5-AER compared to early in the twenty-first century for SSP2-4.5-GHG. Bespoke parameter calibrations for different ERF scenarios are necessary, therefore, to achieve close emulations throughout 1850-2100. This result is important because it demonstrates that emulator performance can be poor for out-of-sample predictions, yet there is no clear a priori way to know if this will be the case. This poses a problem since the value of emulators lies in their use for creating out-of-sample scenarios where AOGCM simulations do not exist and cannot be readily performed.
The average of the emulations for individual models (Figure 4 “Ensemble mean”) has relatively small RMSEs (except for the 1850-2014 calibration). This is due, in part, to averaging of interannual variability across the ensemble of emulations. Further, the ensemble mean generally has smaller RMSEs than an emulation in which the ensemble mean ERF is used to emulate the ensemble temperature projection (Figure 4 “Ensemble emulation”).
Finally, while the optimization method yields unique parameter solutions there is a near linear trade-off between the λ and ε parameters when minimizing the RMSE (Figure S5). For the same RMSE, there are solutions with a strong feedback (λ) with weak pattern effect (ε), and solutions with a weak feedback with strong pattern effect.This shows that optimized values for the λ and ε parameters may not be robust estimates of climate feedback or the AOGCM pattern effect.