1 Introduction

Climate model emulators are simplified physical or statistical models that are computationally efficient. Climate model emulators played a central role in producing future global near-surface temperature projections for the Working Group I Sixth Assessment Report (Forster et al. 2021; Lee et al. 2021) of the Intergovernmental Panel on Climate Change (IPCC AR6). The IPCC AR6 used climate model emulators to supplement simulations from coupled atmosphere-ocean general circulation models (AOGCMs) extending available simulations further into the future and projecting future climate scenarios not available from AOGCMs. It is important, therefore, that the simplifying assumptions used by emulators are rigorously tested so the robustness of their performance is understood.
Physically based climate model emulators, such as energy balance models (EBMs), use bulk physical relationships to emulate the large-scale behavior of Earth’s climate system. For example, EBMs were used by Colman and Soldatenko (2020) to investigate links between climate variability and climate sensitivity and, by Modak and Mauritsen (2021) to investigate the probability of occurrence of the 2000-2012 global warming hiatus.
Two-layer EBMs produce close emulations of idealized abrupt-4xCO2 and 1pctCO2 simulations from AOGCMs (e.g., “EBM-ε” in Geoffroy et al. 2013b; “held-two-layer-uom” in Nicholls et al. 2020). Differences between emulations and AOGCM projections are generally greatest at times of pronounced change in the rate of temperature increase. Such changes are associated with time-varying feedbacks (Senior and Mitchell, 2000; Winton et al., 2010; Armour et al., 2013; Dong et al., 2020; Dunne et al., 2020; Rugenstein et al., 2020; Dong et al., 2021) which are caused by evolving spatial pattern effects in surface temperature (Stevens 2016; Andrews et al., 2015; Rugenstein et al., 2016; Dong et al., 2021) and non-linear state dependences in climate feedbacks (Good et al., 2015; Rohrschneider et al., 2019; Bloch-Johnson et al., 2021). EBMs have been enhanced to capture time-varying feedbacks: the Geoffroy et al. (2013b) EBM includes an efficacy parameter for deep ocean heat uptake and the “held-two-layer-uom” EBM also includes a state dependent feedback parameter (Rohrschneider et al., 2019; Nicholls et al., 2020). These paradigms, however, do not precisely capture the feedback changes in AOGCMs and contribute to model structural error which is irreducible unless the EBM structure is enhanced (e.g., extending a two-layer EBM to three or more layers (Cummins et al., 2020)).
Assessments of emulator performance are more trustworthy when projections are validated using data different from those used to calibrate the model parameters (out-of-sample validation). EBM parameters are frequently calibrated using idealized step-forcing experiments (e.g., abrupt-4xCO2) with the parameters estimated using analytical methods (Geoffroy et al., 2013a) or statistical methods (e.g., Cummins et al., 2020). The Coupled Model Intercomparison Project Phase 6 (CMIP6) (Eyring et al. 2016) historical and future shared socio-economic pathway (SSP) projections for AOGCMs, therefore, are well suited for assessing EBM emulator performance. They can be used to produce out-of-sample assessments using realistic climate scenarios. Although climate model emulators have been evaluated (e.g., Nicholls et al., 2020; Nicholls et al., 2021), it is not known how well emulators perform for the latest CMIP6 (Eyring et al. 2016) AOGCMs using realistic, out-of-sample climate projections and latest assessments of effective radiative forcing (ERF). Furthermore, the contribution of irreducible model structural errors to total prediction error remains poorly understood.
In this study, we evaluate the performance of a two-layer energy balance model (EBM2) (Held et al. 2010; Geoffroy et al. 2013a, b) for emulating CMIP6 historical and future temperature trends using different EBM calibrations. We calibrate the EBM2 parameters for specific periods and ERFs and evaluate the temperature projections for subsequent periods and alternative ERF scenarios. EBM2 is benchmarked against an impulse-response step model and a three-layer EBM.