Interpretation in light of other evidence
There are several possible reasons to understand why the LR model probably performs better compared to the ML model.
Firstly, ML tends to work better for variables with strong predictive power (20,48). We observed that most of the candidate predictors in this model have low predictive power. The variables parity, age and previous c-section show low predictive power. The difference in area under the curve for these predictors that was produced using the permutation based variable importance was <0.02. There are different reasons to explain that this specific dataset, and its separate and combined predictors appeared to have a low predictive power. On one hand, the outcome can be unpredictable, meaning these candidate predictors have little influence on the outcome measure. On the other hand, the dataset can be too small to identify the predictive power of a candidate predictor. A larger dataset could possibly identify more predictors(20,48).
Secondly, some studies demonstrate that ML is performing better when only a small set of pre-specified predictors are used in the prediction model. There seems to be an influence of the number of predictors (p) and the ratio of p:n (sample size). RF tends to perform better for increasing p and p:n. (20,24,49,50) In our study, to limit potential bias, the five identical predictors as published before (1) were considered for the LR and RF algorithms. We did this to allow a fair comparison between the two models, probably in disadvantage of the RF model (20,24,49,50).
Another possible reason for a lower AUC of the RF model is the necessity of big datasets to reach an optimal performance. A dataset with 446 participants might be too small for robust conclusions. For LR however, this number of patients can be enough to develop a prediction model.
Finally, we can also consider that for this clinical problem a logistic approach is better than a RF model for modelling the relationship between surgical re-intervention and the explanatory variables. Probably the previously mentioned complex, nonlinear relationships that a ML approach can better capture are not present in this dataset.
Strengths and limitations
The predictors obtained by univariate and multivariate logistic regression are in accordance with the existing literature (51). However, when we compare the variable importance between the OR (LR) and the difference in AUC (ML) of each variable, we identify a different ranking in variable importance.
The difference in ranking of variable importance is a limitation of the study because there is no proper way to compare the importance of each predictor on surgical re-intervention between the RF and LR model. For the LR model the OR is defined for each predictor X as the odds of a surgical re-intervention in participants having predictor X over participants not having predictor X (Beta). While for the RF model the variable importance is defined as the difference in AUC when predictor X is not permuted.
Dysmenorrhea (OR 2.48) and a parity>5 (OR 7.63) have the highest odds ratio in the multivariate analysis, while for the difference in area under the curve the duration of menstruation and dysmenorrhea are the most important variables. We consider two possible reason for the difference in importance. The first reason is that for the LR model all continuous variables (except age) were discretized, while for the RF model continuous variables were handled. A second reason is that in the LR the predictors have different units, and these were not standardized. This means that a subjective assessment of variable importance cannot easily be made by simply comparing the raw sizes of the OR (2,8,13–18,31,32,44). This can be seen as a strength of our study since the difference in AUC for each predictor (permuted vs. not permuted) reflects the variable importance in a standardized way.
We used bootstrap resampling for internal validation (n=5000) in the LR and RF model. Using the same validation method limits potential bias. Furthermore, the same predictors were considered for the LR and ML algorithms. This limits potential bias, but will limit the potential power of a RF technique as well. Another important strength of this study is the use of all participants in evaluating the performance of the RF model. By using the test sets, there is no need for an independent validation dataset.
It could be seen as a limitation of this study that we did not perform an external validation in another cohort. However, we did not expect it to be significantly better in performance, since the internal validation of the RF did not perform better than the logistic regression model. In addition, an external validation for the logistic regression model is being performed at the time of this study (52).
Finally, we can state that ML models are in our experience not easily implemented in the clinical practice; since these are often not available in commonly used software packages in clinical practice. However, future structured data-registration is increasing, which makes it easier to create big datasets available for ML-programs.