In this paper, we introduce a testbed for evaluating and comparing climate modeling systems at cloud resolving scales using hindcasts of the June 2012 North American derecho. The testbed is applied to two models: the regionally-refined Simple Cloud-Resolving E3SM Atmosphere Model (SCREAM) at horizontal resolutions ranging from 6.5 to 1.625 km and the Weather Research and Forecasting (WRF) model with 4 km grid spacing. We find the simulation results to be highly sensitive to the initial conditions, initialization time, and model configurations, with initial conditions from the Rapid Refresh (RAP) producing the best simulation. Significant improvement is identified in the SCREAM simulations as horizontal grid spacing is refined. While a propagation delay of approximately 2 hours is found in both models, SCREAM at 1.625 km simulates the observed bow echo structure of the derecho well and predicts strong surface gusts that exceed 30 m/s. In comparison, WRF hardly produces surface wind over 25 m/s, and the derecho wind gust in WRF is 42-46% lower than in SCREAM. Moreover, WRF has a lower bias in simulating cold clouds but overestimates the precipitation intensity. Both models well reproduce the observed outgoing longwave radiation spatial patterns (Pearson correlation > 0.88) while they simulate larger areas of composite radar reflectivity > 40 dBZ by up to 4 times and underestimate the precipitating area by ~ 70\% in WRF and 47\% in SCREAM compared to observations.