2.4. Preprocessing
Missing values were common in data extraction. If a patient died or was discharged on the first day (or second day) of admission, there was no data entry on the remaining days. In that case, we assumed that the data on the first day (or second day) could represent the data on the remaining days. Patients whose data lacked more than 20% of the variables were removed. Other missing values were supplemented by the median of the dead and alive group. Descriptive data are expressed as actual numbers and percentages or mean ± standard deviations. Five-fold cross-validation is used in the model. The whole group of data was randomly divided into five pieces. One of the subsamples was retained as the test set, and the remaining four subsamples were used as the training data. Then the cross-validation process was repeated five times. The average of the five times generated was used to represent the performance of each model.
In order to better illustrate the relationship between variables, One-Hot Encoding is adopted to deal with text-based variables. Since Naïve Bayes is not good at dealing with data sets with different dimensions, we standardized the data before the training to make the data conform to the standard normal distribution, that is, the mean value is 0 and the standard deviation is 1.