Despite the proliferation of computer-based research on hydrology and water resources, such research is typically poorly reproducible. Published studies have low reproducibility due to incomplete availability of data and computer code, and a lack of documentation of workflow processes. This leads to a lack of transparency and efficiency because existing code can neither be quality controlled nor re-used. Given the commonalities between existing process-based hydrological models in terms of their required input data and preprocessing steps, open sharing of code can lead to large efficiency gains for the modeling community. Here we present a model configuration workflow that provides full reproducibility of the resulting model instantiations in a way that separates the model-agnostic preprocessing of specific datasets from the model-specific requirements that models impose on their input files. We use this workflow to create large-domain (global, continental) and local configurations of the Structure for Unifying Multiple Modeling Alternatives (SUMMA) hydrologic model connected to the mizuRoute routing model. These examples show how a relatively complex model setup over a large domain can be organized in a reproducible and structured way that has the potential to accelerate advances in hydrologic modeling for the community as a whole. We provide a tentative blueprint of how community modeling initiatives can be built on top of workflows such as this. We term our workflow the “Community Workflows to Advance Reproducibility in Hydrologic Modeling’‘ (CWARHM; pronounced “swarm”).
Machine learning (ML) based models have demonstrated very strong predictive capabilities for hydrologic modeling, but are often criticized for being black-boxes. In this paper we use a technique from the field of explainable AI (XAI), called layerwise relevance propagation (LRP) to “open the black box”. Specifically we train a deep neural network on data from a set of hydroclimatically diverse FluxNet sites to predict turbulent heat fluxes, and then use the LRP technique to analyze what it learned. We show that the neural network learns physically plausible relationships, including different ways of partitioning the turbulent heat fluxes according to moisture or energy limiting characteristics of the sites. That is, the neural network learns different behaviors at arid and non-arid sites. We also develop and demonstrate a novel technique that uses the output of the LRP analysis to explore how the neural network learned to regionalize between sites. We find that the neural network primarily learned behaviors that differed between evergreen forested sites and all other vegetation classes. Our analysis shows that even simple neural networks can extract physically-plausible relationships and that by using XAI methods we can learn new information from the ML-based methods.
Deep learning (DL) methods have shown great promise for accurately predicting hydrologic processes but have not yet reached the complexity of traditional process-based hydrologic models (PBHM) in terms of representing the entire hydrologic cycle. The ability of PBHMs to simulate the hydrologic cycle makes them useful for a wide range of modeling and simulation tasks, for which DL methods have not yet been adapted. We argue that we can take advantage of each of these approaches to couple DL methods into PBHMs as individual process parameterizations. We demonstrate that this is viable by developing DL process parameterizations for turbulent heat fluxes and couple them into the Structure for Unifying Multiple Modeling Alternatives (SUMMA), a modular PBHM modeling framework. We developed two DL parameterizations and integrated them into SUMMA, resulting in a one way coupled implementation (NN1W) which relies only on model inputs and a two-way coupled implementation (NN2W), which also incorporates SUMMA-derived model states. Our results demonstrate that the DL parameterizations are able outperform calibrated standalone SUMMA benchmark simulations. Further we demonstrate that the two-way coupling can simulate the long-term latent heat flux better than the standalone benchmark. This shows that DL methods can benefit from PBHM information, and the synergy between these modeling approaches is superior to either approach individually.
Water resources planning often uses streamflow predictions made by hydrologic models. These simulated predictions have systematic errors which limit their usefulness as input to water management models. To account for these errors, streamflow predictions are bias-corrected through statistical methods which adjust model predictions based on comparisons to reference datasets (such as observed streamflow). Existing bias-correction methods have several shortcomings when used to correct spatially-distributed streamflow predictions. First, existing bias-correction methods destroy the spatio-temporal consistency of the streamflow predictions, when these methods are applied independently at multiple sites across a river network. Second, bias-correction techniques are usually built on simple, time-invariant mappings between reference and simulated streamflow without accounting for the hydrologic processes which underpin the systematic errors. We describe improved bias-correction techniques which account for the river network topology and which allow for corrections that are process-conditioned. Further, we present a workflow that allows the user to select whether to apply these techniques separately or in conjunction. We evaluate four different bias-correction methods implemented with our workflow in the Yakima River Basin in the Pacific Northwestern United States. We find that all four methods reduce systematic bias in the simulated streamflow. The spatially-consistent bias-correction methods produce spatially-distributed streamflow as well as bias-corrected incremental streamflow, which is suitable for input to water management models. We also find that the process-conditioning methods improve the timing of the corrected streamflow when conditioned on daily minimum temperature, which we use as a proxy for snowmelt processes
The hydrology community is engaged in an intense debate regarding the merits of machine learning (ML) models over traditional models. These traditional models include both conceptual and process-based hydrological models (PBHMs). Many in the hydrologic community remain skeptical about the use of ML models, because they consider these models “black-box” constructs that do not allow for a direct mapping between model internals and hydrologic states. In addition, they argue that it is unclear how to encode a priori hydrological expertise into ML models. Yet at the same time, ML models now routinely outperform traditional hydrological models for tasks such as streamflow simulation and short-range forecasting. Not only that, they are demonstrably better at generalizing runoff behavior across sites and therefore better at making predictions in ungauged basins, a long-standing problem in hydrology. In recent model experiments, we have shown that ML turbulent heat flux parameterizations embedded in a PBHM outperform the process-based parameterization in that PBHM. In this case, the PBHM enforced energy and mass constraints, while the ML parameterization calculated the heat fluxes. While this approach provides an interesting proof-of-concept and perhaps acts as a bridge between traditional models and ML models, we argue that it is time to take a bigger leap than incrementally improving the existing generation of models. We need to construct a new generation of hydrologic and land surface models (LSMs) that takes advantage of ML technologies in which we directly encode the physical concepts and constraints that we know are important, while being able to flexibly ingest a wide variety of data sources directly. To be employed as LSMs in coupled earth system models, they will need to conserve mass and energy. These new models will take time to develop, but the time to start is now, since the basic building blocks exist and we know how to get started. If nothing else, it will advance the debate and undoubtedly lead to better understanding within the hydrology and land surface communities regarding the merits and demerits of the competing approaches. In this presentation, we will discuss some of these early studies, illustrate how ML models can offer hydrologic insight, and argue the case for the development of ML-based LSMs.
While machine learning (ML) techniques have proven to have exceptional performance in prediction of variables that have long and varied observational records, it is not clear how to use such techniques to learn about intermediate processes which may not be readily observable. We build on previous work that found that encoding either known, or approximated, physical relationships into the machine learning framework can allow the learned model to implicitly represent processes that are not directly observed, but can be related to an observable quantity. Zhao et al. (2019) found that encoding a Penman-Monteith-like equation of latent heat in an artificial neural network could reliably predict the latent heat while providing an estimate of the resistance term, which is not readily observable at the landscape scale. Specifically, we advance this framework in two ways. First, we expand the physics-based layer to include the partitioning of both the latent and sensible heat fluxes among the vegetation and soil domains, each with their own resistance terms. Second, we couple a land-surface model (LSM), which provides information from simulated processes to the ML model. We thus effectively provide the ML model with both physics-informed inputs as well as maintain constraints such as mass and energy balance on outputs of the coupled ML-LSM simulations. Previously we found that coupling a LSM to the ML model could provide good predictions of bulk turbulent heat fluxes, and in this work we explore how incorporating the additional physics-based partitioning allows the model to learn more ecohydrologically-relevant dynamics in diverse biomes. Further, we explore what the model learned in predicting the unobserved resistance terms and what we can learn from the model itself. Zhao, W. L., Gentine, P., Reichstein, M., Zhang, Y., Zhou, S., Wen, Y., et al. (2019). Physics-Constrained Machine Learning of Evapotranspiration. Geophysical Research Letters, 46(24), 14496–14507. https://doi.org/10.1029/2019GL085291
Machine learning techniques have proven useful at predicting many variables of hydrologic interest, and often out-perform traditional models for univariate predictions. However, demonstration of multivariate output deep learning models has not had the same success as the univariate case in the hydrologic sciences. Multivariate prediction is a clear area where machine learning still lags behind traditional processed based modeling efforts. Reasons for this include the lack of coincident data from multiple variables, which make it difficult to train multivariate deep-learning models, as well as the need to capture inter-variable covariances and satisfy physical constraints. For these reasons process-based hydrologic models are still used to simulate and make predictions for entire hydrologic systems. Therefore, we anticipate that future state of the art hydrologic models will couple machine learning with process based representations in a way that satisfies physical constraints and allows for a blending of theoretical and data driven approaches as they are most appropriate. In this presentation we will demonstrate that it is possible to train deep learning models to represent individual processes, forming an effective process-parameterization, that can be directly coupled with a physically based hydrologic model. We will develop a deep-learning representation of latent heat and couple it to a mass and energy balance conserving hydrologic model. We will demonstrate its performance characteristics compared to traditional methods of predicting latent heat. We will also compare how incorporation of this deep learning representation affects other major states and fluxes internal to the hydrologic model.
The hydrologic cycle is a complex and dynamic system of interacting processes. Hydrologists seeking to understand and predict these systems develop models of varying complexity, and compare their output to observations to evaluate their performance or diagnose shortcomings within the models. Often, these analyses take into account only single variables or isolated aspects of the hydrologic system. To explore how process interactions affect model performance we have developed a general framework based on information theory and conditional probabilities. We compare how conditional mutual information and mean square errors are related in a variety of hydrometeorological conditions. By exploring different regions of phase space we can quantify model strengths and weaknesses in terms of both process accuracy as well as classical performance. By considering a range of conditions we can evaluate and compare models outside of their average behavior. We apply this analysis to physically-based models (based on SUMMA), statistical models, and observations from FluxNet towers at a number of hydro-climatically diverse sites. By focusing on how the turbulent heat fluxes are affected by shortwave radiation, air temperature, and relative humidity we go beyond simple error metrics and are able to reason about model behavior in a physically motivated way. We find that the statistically based models, while showing better performance in the mean field, often do not represent the underlying physics as well as the physically based models. The statistically based model’s over-reliance on shortwave radiation inputs limits their ability to reproduce more complex phenomena.
Planning for hydropower, water resources management, and climate change adaptation requires statistically unbiased hydrologic predictions. However, all hydrologic models contain systematic errors, e.g., incorrect mathematical representations of physical processes and effects of uncertainties in data sources. Statistical post-processing, or bias correction, is often used to reduce the effects of these systematic errors in model outputs. A large number of techniques for performing bias correction has been developed, primarily in the context of correcting statistical properties of independent locations. However, when bias correcting streamflow predictions within the same stream network, this assumption of spatial independence breaks down. Independently bias correcting locations from the headwaters to the mouth of a river system destroys the spatial consistency of the streamflow across a river network. We describe work toward maintaining spatial consistency in streamflow bias correction using a number of locations in the western United States. We simulate the hydrology of the Columbia River in the Pacific Northwestern United States, a river system that spans a number of hydroclimatic and flow regimes that contains a large number of flow gages. We develop a mapping from the modeled output at the gages with flow observations, which we use as the basis for training a machine learning (ML) model to perform the site-specific bias correction. We then apply the ML model to local streamflow contributions for each river segment, including river segments without flow observations. Finally, we combine the local bias corrections across the stream network, to create accumulated bias-corrected streamflow time series that are spatially-consistent across the stream network. We compare our method against several commonly used bias correction techniques to evaluate both model performance and spatial consistency.