Statistical Analysis
The diagnostic performance of AI model and human doctors was assessed by multiple metrics, including accuracy, sensitivity, specificity and AUC. These parameters were defined as following:
Accuracy= the number of correctly labeled images divided by the total number of test images;
Type-specific sensitivity = the number of images correctly labeled with one type of abnormality divided by total number of images with that type of abnormalities;
Overall sensitivity=total number of images correctly labeled with each type of abnormality divided by total number of images with any type of abnormalities;
Type-specific specificity = the number of images correctly labeled without one type of abnormality divided by total number of images without that type of abnormalities;
Overall specificity=total number of images correctly labeled without corresponding type of abnormalities divided by total number of images without any types of abnormalities.
The mean accuracy, sensitivity, specificity, and AUC with 95% confidence intervals (CIs) were calculated. ROC curves were plotted by the sensitivity (true positive rate) versus the 1- specificity (false positive rate). The ROC curve shows the performance of a classification model at all classification thresholds. One sample t-tests were applied to compare the overall performance of AI to that of 13 doctors, as well as to that of doctors of three degrees respectively (AI vs. doctors, and AI vs. expert, competent or trainee). Paired t-tests were applied to comparing the performance of doctors without and with AI assistance. Analysis of variance was applied to compare the average improvement in performance level of doctor of three degrees and Bonferroni correction was applied for all multiple comparisons. All analyses were performed using statistical software (Stata, version 15.0; StataCorp LLC., College Station, TX), and a P value of less than 0.05 was considered significant for all analyses.