Statistical Analysis
The diagnostic performance of AI model and human doctors was assessed by
multiple metrics, including accuracy, sensitivity, specificity and AUC.
These parameters were defined as following:
Accuracy= the number of correctly labeled images divided by the total
number of test images;
Type-specific sensitivity = the number of images correctly labeled with
one type of abnormality divided by total number of images with that type
of abnormalities;
Overall sensitivity=total number of images correctly labeled with each
type of abnormality divided by total number of images with any type of
abnormalities;
Type-specific specificity = the number of images correctly labeled
without one type of abnormality divided by total number of images
without that type of abnormalities;
Overall specificity=total number of images correctly labeled without
corresponding type of abnormalities
divided by total number of images without any types of abnormalities.
The mean accuracy, sensitivity, specificity, and AUC with 95%
confidence intervals (CIs) were calculated. ROC curves were plotted by
the sensitivity (true positive rate) versus the 1- specificity (false
positive rate). The ROC curve shows the performance of a classification
model at all classification thresholds. One sample t-tests were applied
to compare the overall performance of AI to that of 13 doctors, as well
as to that of doctors of three degrees respectively (AI vs. doctors, and
AI vs. expert, competent or trainee). Paired t-tests were applied to
comparing the performance of doctors without and with AI assistance.
Analysis of variance was applied to compare the average improvement in
performance level of doctor of three degrees and Bonferroni correction
was applied for all multiple comparisons. All analyses were performed
using statistical software (Stata, version 15.0; StataCorp LLC., College
Station, TX), and a P value of less than 0.05 was considered significant
for all analyses.