Development of the Random forest model
We first trained a RF model using the five following pre-operative predictors: age, duration of menstruation, dysmenorrhea, parity and previous caesarean section. These factors were associated with a higher probability of surgical re-intervention within two years after EA in the previously published multivariate logistic regression model (1).
As described above, a RF model is an ensemble of many decision tree models. When building decision trees, each tree in the forest uses random samples (patients) from the training set (“tree bagging”). Figure 1 shows an example of an individual decision tree in the random forest. A decision tree is a flowchart-like binary branch structure. At each ‘node split’ in the tree the data are divided in two, based on the value of variable of the decision node. If no more splits are possible a prediction will be calculated for the cases in the final leaf node (23,26,36).
At each node split a random subset of features (such as duration of menstruation and parity) is considered (“feature bagging”), this to avoid over-selection of strong predictive features, leading to similar splits in the trees. This finally leads to a robust model and prevents model overfitting (21,23,26,27,36,37).
Following this process, the classification result of a RF model is produced by computing a large ensemble of those trees and averaging the prediction of each single decision tree on surgical re-intervention. Figure 2 shows a simplified example of the RF model. In practice, the decision trees and the resulting prediction model contain a large number of leaf nodes(26,38).
A RF method has its own hyperparameters: ntree, mtry, minimum leaf size and maximum node splits. Ntree is the number of trees in the forest. It should at the one hand be as large as possible, so that each feature (variable) can have enough opportunities to be picked, but not too large to reduce unnecessary calculation time. A default value of ntree = 500 was used (39). Mtry is the number of features randomly selected as candidate feature at each split (“feature bagging”). This was set as square root of the number of variables (√n). Minimum leaf size is the minimum number of cases that is required to produce another node split. Maximum node splits is the maximum amount of splits. Neither a minimum leaf size nor maximum node splits was set (23,40).
We began running the RF module with default parameter values before starting to improve the RF‘s performance by hyperparameter optimization. Default parameters are pre-set values for the hyperparameters on which the construction of the decision trees is based, for example 500 for ntree (26,27).
To predict the chance of surgical re-intervention within two years after EA, the model was initially trained and internally validated on the 446 cases. To make a good comparison between de RF and LR the same validation technique had to be used. Therefore, a bootstrap resampling of 5000 was used to make training bags and test bags (Out Of Bag (OOB) samples). The cases that were not selected by the bootstrap resampling form the test bag which was used as a validation sample to assess the performance of the trained model on new observations. (Figure 3) The performance measure Area Under The Receiver Operating Curve (AUROC) was calculated on the test sets (the OOB samples) and averaged for the 5000 bootstrap samples. These two bags must not be confused with the “tree bags” and the “feature bags” used to construct the decision trees in the random forest (21,23,26,36).
The RF was trained in MATLAB (2018b) using the TreeBagger function in the Statistics and Machine Learning Toolbox. The curvature test was used for split-predictor selection to get an unbiased selection between the continuous and categorical variables. The Gini-impurity index was used to evaluate the accuracy of a split and to predict the variable importance.  A perfect separation results in a Gini score of zero (all observations belonging to one label, in this case surgical re-intervention or no surgical re-intervention), whereas the worst case split results in 50/50 classes (23).
The parameter optimization was performed by a random grid-search of the minimum leaf size and the maximum number of splits. The minimum leaf size can take a value between 1 and half the sample size (N/2 =223). The maximum number of splits can take a value between 1 and the sample size minus one (N-1 = 445). A random search was chosen since it has been shown to have a similar performance to a full grid search, but has a reduced computation time (38,41).
For each random combination of minimum leaf size and maximum number of splits, a RF was trained on the training bag. The combination was scored using OOB prediction of the tree bags. This was repeated for 20 random combinations, the combination with the highest area under the curve (AUC) on the OOB-predictions was used to train a RF which was tested on the validation test set (42).