Analysis
We calculated the diagnostic performance of each observer to predict level of surgical complexity for each stage, i.e. AAGL stage 1 to predict level A, AAGL stage 2 to predict level B, AAGL stage 3 to predict level C and AAGL stage 4 to predict level D. Data were analysed to determine the kappa and weighted kappa scores, accuracy, sensitivity, specificity, positive predictive value, negative predictive value, positive likelihood ratio, and negative likelihood ratio, with 95% confidence intervals. The AAGL system uses a cumulative point score schema, and stage is determined by score thresholds. The paper byAbrão et al. describes logistic regression to determine the point score thresholds defining stages 1-4, that would most accurately predict skill levels A–D (2). Stage 1 was determined to be 0 – 8 points, stage 2 was 9 to 15 points, stage 3 was 16 to 21 points and stage 4 was above 21 points. We tested our dataset in the same manner: area under the receiver operating characteristic curves (AUROC) were used to determined overall performance of A vs B/C/D (for a threshold of 8), A/B vs C/D (for a threshold of 15) & A/B/C vs D (for a threshold of 21), for each observer.
Continuous data were summarised by mean and standard deviation, median and interquartile range (25th to 75th percentile), and minimum to maximum. Categorical data were summarised by counts and proportions expressed as percentages. Ordinal data are described by cross-tabulation and summarised as described for continuous data.