External validation. For external validation, 757 ILs
(15015 data points) from the ρ dataset were used as training set
and the remaining 215 ILs (4320 data points) served as the testing set.
The detailed results for external validation are listed in Table 2. TheR 2 of training and testing sets reached 0.9922
and 0.9921, respectively. The MAE values were 9.3290
kg/m3 and 9.3606 kg/m3,
respectively. The experimental and calculatedρ values of the model for
the training/testing set are shown in Figure 4c. It is easy to see that
the overall trend of data points in training set and testing set remains
roughly the same and both fit near the diagonal, which shows that this
model has a good predictive ability for ρ of ILs.
Y-randomized analysis. To evaluate the reliability of
the ρ model, Y -random validation was repeated 1000 times.
The results of the Y -random validation with and values were less
than 0.00248 and 0.00689, respectively, far less thanR 2 (0.9919) of the ρ model. Therefore,
the ρ (T ,P ,I )-QSPR model was not affected by
chance correlation.
3.3.2. Model comparison: before and after data pre-screening
To more prominently highlight the importance of data pre-screening in
the pre-modelling work, this work implemented LOIO-CV by the data
without screening using the same descriptors as the Eq. (15). The
detailed LOIO-CV results are shown in Table 3. HigherQ 2(Q 2LOCO = 0.9919 andQ 2LOAO = 0.9899) and lower MAE
(MAELOCO = 8.6487 kg/m3 and
MAELOAO = 10.2462 kg/m3) were obtained
when the model was built using the dataset without data pre-screening.
However, when Q 2 are recalculated by the
dataset selected by following the data pre-screening rules, there was a
decrease in Q 2(Q 2LOCO = 0.9903 andQ 2LOAO = 0.9884) and an
increase in MAE (MAELOCO = 10.2589
kg/m3 and MAELOAO = 11.8680
kg/m3). In addition, the results in Table 3 show that
the Q 2(Q 2LOCO = 0.9905 andQ 2LOAO = 0.9894) of
pre-screened data is higher than the Q 2(Q 2LOCO = 0.9903 andQ 2LOAO = 0.9884) post-screened
data. It is proved that the model although having a higherQ 2without data pre-screening, is
“pseudo-high” in accuracy. Therefore, it is necessary to carry out a
pre-screening process before modelling to ensure a balanced distribution
of the dataset.
Table 3 . Comparison of model stability before and after data
pre-screening for density.