Development of the Random forest model
We first trained a RF model using the five following pre-operative
predictors: age, duration of menstruation, dysmenorrhea, parity and
previous caesarean section. These factors were associated with a higher
probability of surgical re-intervention within two years after EA in the
previously published multivariate logistic regression model (1).
As described above, a RF model is an ensemble of many decision tree
models. When building decision trees, each tree in the forest uses
random samples (patients) from the training set (“tree bagging”).
Figure 1 shows an example of an individual decision tree in the random
forest. A decision tree is a flowchart-like binary branch structure. At
each ‘node split’ in the tree the data are divided in two, based on the
value of variable of the decision node. If no more splits are possible a
prediction will be calculated for the cases in the final leaf node
(23,26,36).
At each node split a random subset of features (such as duration of
menstruation and parity) is considered (“feature bagging”), this to
avoid over-selection of strong predictive features, leading to similar
splits in the trees. This finally leads to a robust model and prevents
model overfitting (21,23,26,27,36,37).
Following this process, the classification result of a RF model is
produced by computing a large ensemble of those trees and averaging the
prediction of each single decision tree on surgical re-intervention.
Figure 2 shows a simplified example of the RF model. In practice, the
decision trees and the resulting prediction model contain a large number
of leaf nodes(26,38).
A RF method has its own hyperparameters: ntree, mtry, minimum leaf size
and maximum node splits. Ntree is the number of trees in the forest. It
should at the one hand be as large as possible, so that each feature
(variable) can have enough opportunities to be picked, but not too large
to reduce unnecessary calculation time. A default value of ntree = 500
was used (39). Mtry is the number of features randomly selected as
candidate feature at each split (“feature bagging”). This was set as
square root of the number of variables (√n). Minimum leaf size is the
minimum number of cases that is required to produce another node split.
Maximum node splits is the maximum amount of splits. Neither a minimum
leaf size nor maximum node splits was set (23,40).
We began running the RF module with default parameter values before
starting to improve the RF‘s performance by hyperparameter optimization.
Default parameters are pre-set values for the hyperparameters on which
the construction of the decision trees is based, for example 500 for
ntree (26,27).
To predict the chance of surgical re-intervention within two years after
EA, the model was initially trained and internally validated on the 446
cases. To make a good comparison between de RF and LR the same
validation technique had to be used. Therefore, a bootstrap resampling
of 5000 was used to make training bags and test bags (Out Of Bag (OOB)
samples). The cases that were not selected by the bootstrap resampling
form the test bag which was used as a validation sample to assess the
performance of the trained model on new observations. (Figure 3) The
performance measure Area Under The Receiver Operating Curve (AUROC) was
calculated on the test sets (the OOB samples) and averaged for the 5000
bootstrap samples. These two bags must not be confused with the “tree
bags” and the “feature bags” used to construct the decision trees in
the random forest (21,23,26,36).
The RF was trained in MATLAB (2018b) using the TreeBagger function in
the Statistics and Machine Learning Toolbox. The curvature test was used
for split-predictor selection to get an unbiased selection between the
continuous and categorical variables. The Gini-impurity index was used
to evaluate the accuracy of a split and to predict the variable
importance. A perfect separation results in a Gini score of zero (all
observations belonging to one label, in this case surgical
re-intervention or no surgical re-intervention), whereas the worst case
split results in 50/50 classes (23).
The parameter optimization was performed by a random grid-search of the
minimum leaf size and the maximum number of splits. The minimum leaf
size can take a value between 1 and half the sample size (N/2 =223). The
maximum number of splits can take a value between 1 and the sample size
minus one (N-1 = 445). A random search was chosen since it has been
shown to have a similar performance to a full grid search, but has a
reduced computation time (38,41).
For each random combination of minimum leaf size and maximum number of
splits, a RF was trained on the training bag. The combination was scored
using OOB prediction of the tree bags. This was repeated for 20 random
combinations, the combination with the highest area under the curve
(AUC) on the OOB-predictions was used to train a RF which was tested on
the validation test set (42).