2.4. Preprocessing
Missing values were common in data extraction. If a patient died or was
discharged on the first day (or second day) of admission, there was no
data entry on the remaining days. In that case, we assumed that the data
on the first day (or second day) could represent the data on the
remaining days. Patients whose data lacked more than 20% of the
variables were removed. Other missing values were supplemented by the
median of the dead and alive group. Descriptive data are expressed as
actual numbers and percentages or mean ± standard deviations. Five-fold
cross-validation is used in the model. The whole group of data was
randomly divided into five pieces. One of the subsamples was retained as
the test set, and the remaining four subsamples were used as the
training data. Then the cross-validation process was repeated five
times. The average of the five times generated was used to represent the
performance of each model.
In order to better illustrate the relationship between variables,
One-Hot Encoding is adopted to deal with text-based variables. Since
Naïve Bayes is not good at dealing with data sets with different
dimensions, we standardized the data before the training to make the
data conform to the standard normal distribution, that is, the mean
value is 0 and the standard deviation is 1.