Bootstrap Validation.
The bootstrap method is a resampling technique often used to estimate
statistics on a population as well as validate a model by sampling a
dataset with replacement. The bootstrap method allows us to use a
computer to mimic the process of obtaining new data sets so that the
variability of the estimates can be assessed without creating additional
samples. Instead of repeatedly obtaining independent data set from the
population, which is often not realistic, in bootstrapping, distinct
data sets are obtained by repeatedly doing sampling from the original
data set with replacement. The idea behind bootstrapping is the original
observed data will take the place of the population of interest, and
each bootstrap sample will represent a sample from that population.
Bootstrap samples are of the same size as the original sample and drawn
randomly with replacement from the original sample. In a with
replacement sampling, after a data point (observation) is selected for
the subsample, it is still available for further selection. As a result,
some observations represented multiple times in the bootstrap sample
while others may not be selected at all. Because of such overlaps with
original data, on average almost two-thirds of the original data points
appear in each bootstrap sample4. The samples that are
not included in a bootstrap sample are called “out-of-bag” samples.
When performing the bootstrap, two things must be specified: the size of
the sample and the number of repetitions of the procedure to perform. A
common practice is to use a sample size that is equivalent to the
original data set and a large number of repetitions (50-200) to get a
stable performance2, 4.
In the bootstrap method, a prediction model is developed in each
bootstrap sample and measures of predictive ability such as C-statistic
are estimated in each bootstrap sample. Then these models from bootstrap
data are applied to the original dataset to evaluate the model and
estimate the predictive measure (C-statistic) of these bootstrap models
in the original data. The difference in performance in the predictive
measure indicates optimism, which is estimated by averaging out all the
differences in predictive measures. Finally, this estimate of optimism
is subtracted from the performance of the original prediction model
developed in the original data to get an optimism-adjusted measure of
the predictive ability of the model2.
Bootstrap samples have significant overlap with the original data
(roughly two-third) which causes the method to underestimate the true
estimate. This is considered a disadvantage of this method. However,
this issue can be solved by performing prediction on only those
observations that were not selected by the bootstrap and estimating
model performance. Bootstrapping is more complex to analyze and
interpret due to the methods used and the amount of computation
required. However, this method provides stable results (less variance)
than other methods with a large number of repetitions.
It is very obvious that each of the internal model validation techniques
has advantages and disadvantages and no one method is uniformly better
than another4. Researchers have a different opinion on
choosing the appropriate method for internal model validation. Several
factors such as sample size, finding the best indicators of a model’s
performance and choosing between models were asked to consider before
making the choice4.
The above-mentioned procedures for model validation pertain to internal
validation, which does not examine the generalizability of the model. To
ensure generalizability, it is necessary to use new data not used in the
development process, collected from an appropriate (representative)
patient population but using a different set of data.