Figure 2. Distribution of vertical levels in various SHiELD configurations for a surface pressure of 1000 hPa and a standard atmospheric temperature structure.
The simulation characteristics and prediction skill of SHiELD have been previously discussed in several papers and will not be repeated here. Improving predictions of tropical cyclone track, intensity, and genesis has been a key driver of SHiELD development: Chen et al. (2019a) describes the 2016 and 2017 versions, while the considerably improved 2018 version is described in Chen et al. (2019b). Most notably SHiELD greatly improves upon other global models’ ability to predict tropical cyclone intensity. The large-scale prediction skill, and CONUS precipitation and 2-m temperature skill, are briefly described for the 2016 and 2017 versions in LZhou2019 and Harris2019.
The anomaly correlation coefficient (ACC) of the 500-mb geopotential height field is the standard means for evaluating the large-scale prediction skill of medium-range prediction models. Figure 3 (top) shows that the global ACC of SHiELD has been better at all lead times than the contemporary GFS since the 2017 version, and significantly so on days 1–6. At all lead times except for days 7 and 8, each new version has improved upon the previous version. The result for root-mean square error (RMSE; Figure 3, bottom) is even more striking: every version is an improvement upon the previous at every lead time, and both the 2018 and 2019 versions are significantly better than the operational GFS. Results for just the northern hemisphere (20N–80N, Supplemental Figure S1) are less dramatic but SHiELD still shows statistically significant improvements in ACC and RMSE out to day 5. Both the GFS and all versions of SHiELD reach an ACC of 0.6 at 8.3–8.5 days globally and 8.5–8.7 days in the northern hemisphere, with some year-to-year and version-to-version variability.
The time series of day-5 global ACC and RMSE (Figure 4) shows that while there is a general secular improvement in both SHiELD and the GFS, there can be large seasonal and even interannual variability in forecast skill. Usually, predictions are more skillful in northern winter, as strong synoptic forcing dominates the large-scale weather patterns, but some northern summers see little to no forecast degradation. The implementation of GFSv13 on 11 May 2016, which included a major upgrade to the data assimilation cycling system of the GFS, significantly reduced RMSE in May and June 2016 compared to the preceding four months of the year. These results are worthy of further investigation. We do conclude that it may be misleading to use a short time period to evaluate or compare global prediction models.
The time-evolution of the large-scale forecast skill for both the GFS and SHiELD are very similar on monthly and shorter time-periods, which is expected as they use identical initial conditions, and SHiELD benefits from continual upgrades of the GFS initial conditions. As discussed in Chen et al. (2019b) the quality of the initial conditions is the preeminent factor in determining the forecast skill for the large-scale circulation as well as for metrics such as hurricane track forecasts that depend closely on the prediction skill of the large-scale flow.
These results are for hindcasts but the ACC and RMSE for our real-time forecasts are nearly identical. An important caveat is that the operational GFS supports nearly the entire NCEP modeling suite, and so the GFS has many more demands and a much more stringent evaluation process imposed upon its development than does SHiELD. The development cycle of the GFS will therefore necessarily be less rapid and more methodological than that of SHiELD. Alternately, an experimental research model like SHiELD does have the freedom to pursue many different avenues for model development (“failure is always an option”) so that the most successful new ideas can later be transitioned into operations, a major goal of the UFS.