Interpretation in light of other evidence
There are several possible reasons to understand why the LR model
probably performs better compared to the ML model.
Firstly, ML tends to work better for variables with strong predictive
power (20,48). We observed that most of the candidate predictors in this
model have low predictive power. The variables parity, age and previous
c-section show low predictive power. The difference in area under the
curve for these predictors that was produced using the permutation based
variable importance was <0.02. There are different reasons to
explain that this specific dataset, and its separate and combined
predictors appeared to have a low predictive power. On one hand, the
outcome can be unpredictable, meaning these candidate predictors have
little influence on the outcome measure. On the other hand, the dataset
can be too small to identify the predictive power of a candidate
predictor. A larger dataset could possibly identify more
predictors(20,48).
Secondly, some studies demonstrate that ML is performing better when
only a small set of pre-specified predictors are used in the prediction
model. There seems to be an influence of the number of predictors (p)
and the ratio of p:n (sample size). RF tends to perform better for
increasing p and p:n. (20,24,49,50) In our study, to limit potential
bias, the five identical predictors as published before (1) were
considered for the LR and RF algorithms. We did this to allow a fair
comparison between the two models, probably in disadvantage of the RF
model (20,24,49,50).
Another possible reason for a lower AUC of the RF model is the necessity
of big datasets to reach an optimal performance. A dataset with 446
participants might be too small for robust conclusions. For LR however,
this number of patients can be enough to develop a prediction model.
Finally, we can also consider that for this clinical problem a logistic
approach is better than a RF model for modelling the relationship
between surgical re-intervention and the explanatory variables. Probably
the previously mentioned complex, nonlinear relationships that a ML
approach can better capture are not present in this dataset.
Strengths and limitations
The predictors obtained by univariate and multivariate logistic
regression are in accordance with the existing literature (51). However,
when we compare the variable importance between the OR (LR) and the
difference in AUC (ML) of each variable, we identify a different ranking
in variable importance.
The difference in ranking of variable importance is a limitation of the
study because there is no proper way to compare the importance of each
predictor on surgical re-intervention between the RF and LR model. For
the LR model the OR is defined for each predictor X as the odds of a
surgical re-intervention in participants having predictor X over
participants not having predictor X (Beta). While for the RF model the
variable importance is defined as the difference in AUC when predictor X
is not permuted.
Dysmenorrhea (OR 2.48) and a parity>5 (OR 7.63) have the
highest odds ratio in the multivariate analysis, while for the
difference in area under the curve the duration of menstruation and
dysmenorrhea are the most important variables. We consider two possible
reason for the difference in importance. The first reason is that for
the LR model all continuous variables (except age) were discretized,
while for the RF model continuous variables were handled. A second
reason is that in the LR the predictors have different units, and these
were not standardized. This means that a subjective assessment of
variable importance cannot easily be made by simply comparing the raw
sizes of the OR (2,8,13–18,31,32,44). This can be seen as a strength of
our study since the difference in AUC for each predictor (permuted vs.
not permuted) reflects the variable importance in a standardized way.
We used bootstrap resampling for internal validation (n=5000) in the LR
and RF model. Using the same validation method limits potential bias.
Furthermore, the same predictors were considered for the LR and ML
algorithms. This limits potential bias, but will limit the potential
power of a RF technique as well.
Another important strength of this study is the use of all participants
in evaluating the performance of the RF model. By using the test sets,
there is no need for an independent validation dataset.
It could be seen as a limitation of this study that we did not perform
an external validation in another cohort. However, we did not expect it
to be significantly better in performance, since the internal validation
of the RF did not perform better than the logistic regression model. In
addition, an external validation for the logistic regression model is
being performed at the time of this study (52).
Finally, we can state that ML models are in our experience not easily
implemented in the clinical practice; since these are often not
available in commonly used software packages in clinical practice.
However, future structured data-registration is increasing, which makes
it easier to create big datasets available for ML-programs.