Calibration.
The agreement between observed outcomes and predictions made by the model is referred to as calibration1. Model calibration measures the validity of the predictions and determines whether the predictions based on the risk prediction model align with what is observed within the study cohort. For example, if we predict a 20% risk that a person will develop hypertension, the observed frequency of hypertension should be 20 out of 100 people with such a prediction. Calibration plot is a method that visually inspects calibration and presents plot for predicted against observed probabilities. It also uses the Hosmer-Lemeshow test to assess calibration. In a calibration plot, predictions are plotted on the x-axis and the observed outcome on the y-axis. In the y-axis, the plot contains only 0 and 1 values for binary outcomes. Different smoothing techniques (e.g., the loess algorithm) can be employed to estimate the observed probabilities of the outcome for the predicted probabilities. Perfect predictions should be on the 45° line suggesting that predicted risks are correct. An alternative assessment of calibration is to categorize predicted risk into groups (e.g., deciles) and assess whether the event rate corresponds to the average predicted risk in each risk group. The Hosmer-Lemeshow goodness-of-fit-test makes the plot of a graphical illustration to assess whether the observed event rates match expected event rates in subgroups of the model population.
For survival data, the calibration is usually assessed at fixed time points2. Within each time point, survival rates are calculated by the Kaplan-Meier method for a group of patients. Then this observed survival is compared with the mean predicted survival from the prediction model2.
Besides the above-mentioned major measures of model assessment, there are other measures occasionally used to assess a model. Although calibration and discrimination are considered the most important aspects to assess a model, they did not provide any assessment regarding the clinical usefulness of a model. Clinical usefulness assessment helps to understand the ability of a model to make better decisions compared to a situation when the model was not used. The measures associated with clinical usefulness are generally related to a cutoff, a decision threshold of the model, which classify peoples into low and high-risk groups balancing the likelihood of benefit and likelihood of harm. Net benefit (NB) is one such measure that can be used to assess the clinical usefulness of a model.