Interpretation
The AAGL system accurately predicted surgical complexity level in 66.2% of cases, which is comparable to the 69.2% found in the original paper (2). In our study the overall agreement between AAGL stage and AAGL complexity level was weak, as quantified by a weighted kappa score of 0.38 – 0.42 across the three observers. This was low, compared with 0.621 in the original study (2), which suggested moderate agreement. Stage 1 performed reasonably well at predicting skill level A and this was consistent across the three observers, however the remaining stages 2, 3 and 4 did not correlate well.
The pre-specified AAGL cut-points had reasonably high specificity for discerning skill level A/B/C versus D (stage 4) but low specificity for A versus B/C/D and A/B/C versus D (lower levels). When AUCROC data in this external validation are directly compared to those reported in the paper by in the original paper (2), the results are less robust. For A versus B/C/D, AUCROC in the original paper was 0.98, and in our analysis, it was lower at 0.75 to 0.89. For A/B versus C/D, AUCROC in the original paper was 0.95, and in our analysis, it was lower at 0.81. For A/B/C versus D, AUCROC in the original paper was 0.91, and in our analysis, it was higher at 0.95 to 0.96. This may reflect the fact that in the original paper, regression analysis was used to identify optimal cut points for that particular dataset, so the performance would therefore be expected to be less promising when externally validated. Poor diagnostic accuracy for levels 2, 3 and 4 and lower than previously reported AUCROC results in our dataset suggest that the AAGL staging tool is not be generalizable in its current form.
While stage 4 had a low PPV for predicting surgical complexity level D (47.5%), the specificity (91.7%) and PPV (99.57 %) were high. This demonstrates that stage 4 performs well at ruling out those without lower surgical complexity levels. The AUROC for stage 4 to discriminate level D from levels A/B/C was high at 0.95, which confirmed this finding. These results suggest that the tool might be useful for surgical planning, although if the stage can only be determined intraoperatively, the utility of this is limited.