Figure 13 Comparison of predicted and observed Q90
[Insert Figure 14]
Figure 14 Comparison of predicted and observed Q99.7
[Insert Figure 15]
Figure 15 Performance of 8 ML models on the training sets
[Insert Figure 16]
Figure 16 Performance of 8 ML models on the testing sets
In this paper, a comparative analysis is conducted on the predicted and observed values of 15 streamflow percentiles corresponding to the FDCs obtained from 8 models, and the predicted and observed values show consistency. From Figure 10 to Figure 14 , we mainly analyze the prediction results of five key streamflow percentiles (, ,,,).
Overall, the ratio of predicted to observed values is stable around 1, and R 2 is close to 1. Neural networks have better generalization capabilities than other machine learning algorithms, as evidenced by their better predictive accuracy on the testing set. The predictive accuracy of each model for the upper tail of the FDC is higher than that for the lower tail, withQ 5 having the highest predictive accuracy andQ 99.7 having the lowest predictive accuracy. The prediction difference between observed and predicted values may be attributed to the random-like property of hydrological phenomena. Related literature (Montanari and Koutsoyiannis, 2012) also reached similar conclusions. Among the eight ML models on the testing set, the prediction performance of ELM and RBF is worse, probably due to the simplicity of single-layer neural networks. The predictive ability of the XGB, PSO-BP and GWO-BP models is significantly better. We noticed that these three models show good predictive ability on both the training and testing set, with the R 2 for both the training and testing set being greater than 0.8 at different streamflow percentiles. The predictive performance of models shouldn’t only be evaluated by a single metric. Compared to scatter plots, which can only display the relationship between individual indicators (Choubin et al., 2018) , the Taylor diagram integrates three evaluation metrics: correlation coefficient, centered root-mean-square, and standard deviation, based on the cosine relationship between the three to evaluate the predicting performance from different perspectives (Figure 15Figure 16 ). When the model prediction results are consistent with the observed values, the closer the point “model” is to the point “observed” on the x-axis, the higher the correlation between such models and observations.
[Insert Figure 17]
Figure 17 Comparison of predicted value obtained by PSO-BP, GWP-BP and observed data (Testing sets)
For training sets, the RBF and ELM models’ prediction performances are poor, while XGB performs the best. For high tails, the prediction performance of the eight models can be ranked as XGB > SVM >RF > PSO-BP> BPNN > GWO-BP > RBF > ELM, while the prediction performance can be ranked as XGB > RF > BPNN > GWO-BP > SVM >PSO-BP > RBF > ELM for low-tailed data.
As for testing sets, the prediction performance of the BPNN is not as good as the other seven models, while XGB, PSO-BP, and GWO-BP all exhibit good performance. For high tails, the prediction performance can be ranked as follows: PSO-BP > GWO-BP > XGB > RF > SVM > BPNN > ELM > RBF. For low-tailed data, the prediction performance is ranked as follows: GWO-BP > XGB > PSO-BP > RF >BPNN > ELM >SVM >RBF.
It is worth noting that the prediction accuracy of the lower tail of FDC through machine learning is significantly lower than that of the upper tail, but GWO-BP and XGB perform well in predicting the lower tail. By comparing evaluation indicators, it is determined that the GWO-BP and XGB models are the best models for predicting FDC. Moreover, it can be concluded that optimizing ML model parameters using the swarm intelligence optimization algorithms can effectively and significantly enhance the model’s predictive capability and generalization ability by comparing BPNN, PSO-BP, and GWO-BP (Figure 17 ).

4.3 Prediction results throughout the entire duration

[Insert Figure 18]
Figure 18 Overall evaluation (R2) (testing sets) of the estimated quality of ML models (15 streamflow percentiles)
Multiple points of streamflow percentiles can reflect the shape of the FDC. The R 2 and NSE are usually used to assess the model prediction. We believe that an R 2greater than 0.85 indicates good predictive performance for the model. In addition to considering R 2, std ,cor , and R MSE which we have analyzed and discussed in the condition of six most typical streamflow percentiles in the previous section, models with NSE values less than or equal to 0.50, 0.50~0.65, 0.65~0.75, and greater than 0.75 are considered to represent 4 categories: bad, satisfactory, good, and excellent performance respectively (Fatehi et al., 2015) .
[Insert Figure 19]
Figure 19 Overall evaluation (NSE) (testing set) of the estimated quality of ML models (15 streamflow percentiles)
As shown in Figure 18 and Figure 19 , considering theR 2 and NSE criteria, the results show that RF, PSO-BP, and XGB all achieve very good performance, except for Q99.7 which only has satisfactory results. And it is observed that the XGB model has less predictive power for larger and smaller streamflow percentiles than for the middle streamflow percentiles, with particularly good predictive performance for the middle part. The GWO-BP model performs well across the entire duration range in the testing set (i.e., high flow to low flow) with R 2 of 0.86 to 0.94 and NSE of 0.78 to 0.94. The performance of the RF, PSO-BP, and XGB models is also good throughout the entire duration, but it is lower than that of the GWO-BP model. From the perspective of the sustained range of the entire FDC, the GWO-BP is the best model to predict FDC among all the models in this paper.
Compared with the research of Vafakhah and Khosrobeigi Bozchaloei (Vafakhah and Khosrobeigi Bozchaloei, 2020) , which is believed that SVR is the optimal model for predicting FDC with relative RMSE of 9.37 to 1.45 and NSE of 0.54 to 0.91, the GWO-BP model we selected in this paper greatly improve the accuracy of prediction.

4.4 Feature importance analysis of the processes

[Insert Figure 20]
Figure 20 Interpretation of the predicted FDC and feature important analysis
Due to the “black box” issue, the ML models have their limitations. (Esterhuizen et al., 2022) . The “feature importance” merely reflects which feature is more important, but how it influences the prediction results is unknown. In this paper, Shapley was used to explain the results of machine learning. The advantage of SHAP values is that they not only reveal the impact of each feature in given samples but also indicate the sign of that impact (i.e., whether it is positive or negative) (Dikshit and Pradhan, 2021) .
From Figure 20 , it can be seen that the impact of 22 variables of basin characteristics on 6 critical streamflow percentiles (,,,,,) was analyzed. The SHAP values are calculated for each sample and variable, globally demonstrating the impact of feature on the model, which quantifies the contribution of 22 environmental variables to different streamflow percentiles. Each row represents an environmental input variable, with the horizontal axis indicating the distribution of SHAP values. Each point represents a sample, with color indicating the feature value number (red for high values and blue for low values).
Comparing Figure 20 (a) and Section 3.2, the correlation coefficients between BFI_mean, Smax, ATP, DP_std, AP_max, , SWS and are relatively high, which are slightly different from the neural network prediction results but generally consistent. It is worth paying attention to that the zero value represents the average value of on the horizontal axis. Considering , there is a rise in the value of SHAP when decreases (changes in the color from red to blue). When reaches its maximum, the is 1 lower than its average value, while reaches its minimum, the is 2 higher than its average value. This is because a high probability of no precipitation days indicates a decrease in precipitation frequency (Cheng et al., 2012) , which will significantly lower the value of flow rate quantile.
It also can be seen that the impact of environmental variables on different streamflow quantiles varies noticeably. However, the two main influencing factors for streamflow quantiles remain nearly unchanged, with and BFI_mean playing the key roles. High values of the will reduce the flow rate of streamflow quantiles, exerting a negative impact, while high values of the BFI_mean feature will increase the flow rate of streamflow quantiles, exerting a positive impact. with higher SHAP values results in lower streamflow percentiles, exerting a negative impact to the output values, while BFI_mean with higher SHAP values results in higher streamflow percentiles, having positive impacts on the output values. Additionally, it can be observed that contributes more to the prediction of high flow rate values such as ,, which means it will predict the upper tail of FDC more accurately, while BFI_mean has a greater impact on the prediction of low flow rate values such as , which means it will predict the lower tail of FDC more accurately.
The main influencing factors obtained through SHAP in this paper are consistent with the physical controls of the gamma distribution fitting parameters in the same region, proving the accuracy of the model in this paper (Yu Zhou, 2023). The influence of annual average precipitation and maximum precipitation on the prediction results of flow quantiles is not significant, while the prediction results of flow quantiles are closely related to , indicating that high flow may be driven by short-term precipitation events, which are closely related to the frequency of precipitation occurrence, and these events cannot captured by annual average precipitation. The frequency of precipitation has a significant impact on the prediction results of FDC, mainly because it directly affects the runoff generation mechanism and water balance of the watershed (Butcher et al., 2021). High precipitation frequency means that there will be more precipitation in a shorter period of time, which will lead to faster collection of surface runoff, the increase in saturation degree of soil and reduction of the infiltration capacity of soil (Crow et al., 2018). Soil moisture content is high and the evaporation amount will decrease accordingly, resulting in reduced water consumption. Thus, the negative impact of precipitation frequency on the streamflow corresponding to percentiles may be mainly due to the fact that frequent precipitation can lead to faster collection of surface runoff, resulting in slower increases or faster decreases in the streamflow. The BFI is largely influenced by the water storage capacity of the aquifer and human activities. It indicates the importance of the aquifer’s water storage capacity in predicting low flow parts (Mazvimavi et al., 2004) . Basins with low BFI cannot maintain good water flow mobility. This may result in a shorter duration of flow in high flow areas, while basins with high BFI can better maintain water flow mobility, thus maintaining high flow conditions for a longer period of time, which can explain why BFI_mean exerts positive impacts to the output values and has greater impacts on the prediction of low flow rate values.
Like other statistical-based methods (Burgan and Aksoy, 2022c) , this paper also has shortcomings of the subjectivity and uncertainty in variables selection (Veber Costa, 2020) . In future research, except for exploring more watershed characteristics that influence FDCs and incorporating them into the model for more precise prediction, larger datasets and scales (e.g., global scale) are needed to be considered and examined to enhance the applicability of the model before it can be applied to various watersheds with more diverse climate and landscape conditions.
For most data-driven models, such as neural networks, only the correlation between inputs and outputs is utilized, and the impact mechanism of influencing factors is unknown (Atieh et al., 2017c; Bozchaloei and Vafakhah, 2015) (Atieh et al., 2017c; Bozchaloei and Vafakhah, 2015). Scholars pointed out that machine learning (ML) can help hydrology make progress in many ways, including (1) incorporating physics into ML models; and (2) improving the explanatory ability of ML models (Shen, 2018) (Shen, 2018). From these two perspectives, the findings of this paper can provide new methods and insights for more accurately data-driven FDC curve prediction and analysis, which will help provide scientific basis for water resource management and hydrological forecasting and reveal the undelying physical processes.

5 CONCLUSION

This paper proposed the different ML methods to estimate FDCs. Based on a total of 645 sets of samples, made up of 22 basin characteristic variables (including “mutable” and “immutable”), eight ML models are integrated to predict the FDC (flow quantiles corresponding to 15 exceedance probabilities). Moreover, the SHAP analysis was used to identify the main input variables that affect the prediction results of different streamflow quantiles and the degree of that influence. The optimal model for predicting under this environmental condition was found. The main conclusions can be drawn in the following:
  1. With the high prediction accuracy and good generalization ability, GWO-BP and XGB are the best models for predicting FDC. Moreover, optimizing ML model parameters using the swarm intelligence optimization algorithms can significantly enhance the model’s predictive capability and generalization ability of the original BPNN.
  2. From the perspective of the sustained range of the entire FDC, GWO-BP is the best predictive model among the eight with R 2 of 0.86 to 0.94 andNSE of 0.78 to 0.94 in the testing set. It significantly improved the prediction accuracy of existing research, which is believed that SVR is the optimal model for predicting FDC with RMSE of 9.37 to 1.45 and NSE of 0.54 to 0.91 (Vafakhah and Khosrobeigi Bozchaloei, 2020) .
  3. The predictive impact of variables on different quantiles varies, with and BFI_mean contributes the most significantly to predicting FDC. The has negative effects on the prediction result and has better contribution to predicting higher flow rate, which is mainly due to the fact that frequent precipitation can lead to faster collection of surface runoff, resulting in slower increases or faster decreases in the streamflow. Basins with low BFI cannot maintain good water flow mobility, which may result in a shorter duration of flow in high flow areas, while basins with high BFI can better maintain water flow mobility, thus maintaining high flow conditions for a longer period of time. Therefore, BFI_mean exerts positive impacts to the output values and has a greater impact on the prediction of low flow rate values.