External validation. For external validation, 757 ILs (15015 data points) from the ρ dataset were used as training set and the remaining 215 ILs (4320 data points) served as the testing set. The detailed results for external validation are listed in Table 2. TheR 2 of training and testing sets reached 0.9922 and 0.9921, respectively. The MAE values were 9.3290 kg/m3 and 9.3606 kg/m3, respectively. The experimental and calculatedρ values of the model for the training/testing set are shown in Figure 4c. It is easy to see that the overall trend of data points in training set and testing set remains roughly the same and both fit near the diagonal, which shows that this model has a good predictive ability for ρ of ILs.
Y-randomized analysis. To evaluate the reliability of the ρ model, Y -random validation was repeated 1000 times. The results of the Y -random validation with and values were less than 0.00248 and 0.00689, respectively, far less thanR 2 (0.9919) of the ρ model. Therefore, the ρ (T ,P ,I )-QSPR model was not affected by chance correlation.
3.3.2. Model comparison: before and after data pre-screening
To more prominently highlight the importance of data pre-screening in the pre-modelling work, this work implemented LOIO-CV by the data without screening using the same descriptors as the Eq. (15). The detailed LOIO-CV results are shown in Table 3. HigherQ 2(Q 2LOCO = 0.9919 andQ 2LOAO = 0.9899) and lower MAE (MAELOCO = 8.6487 kg/m3 and MAELOAO = 10.2462 kg/m3) were obtained when the model was built using the dataset without data pre-screening. However, when Q 2 are recalculated by the dataset selected by following the data pre-screening rules, there was a decrease in Q 2(Q 2LOCO = 0.9903 andQ 2LOAO = 0.9884) and an increase in MAE (MAELOCO = 10.2589 kg/m3 and MAELOAO = 11.8680 kg/m3). In addition, the results in Table 3 show that the Q 2(Q 2LOCO = 0.9905 andQ 2LOAO = 0.9894) of pre-screened data is higher than the Q 2(Q 2LOCO = 0.9903 andQ 2LOAO = 0.9884) post-screened data. It is proved that the model although having a higherQ 2without data pre-screening, is “pseudo-high” in accuracy. Therefore, it is necessary to carry out a pre-screening process before modelling to ensure a balanced distribution of the dataset.
Table 3 . Comparison of model stability before and after data pre-screening for density.