2.5. Modeling
In this study, five prediction models of RF, GBDT, decision tree, KNN,
and Naïve Bayes models were established and compared.
Naïve Bayes is the Bayesian theory based on probability theory and
mathematical statistics It is a supervised algorithm that directly
measures the probability relationship between labels and features[15]. Naïve Bayes’ simplicity makes the model run
quickly. However, the condition of its validity is that the features of
samples are independent [16]. That is difficult in
practical application.
K-Nearest Neighbor (KNN) is a simple non-parametric classification
method. Assuming the data setdto be classified, k of its nearest neighbors was retrieved and
calculated as the neighborhood of d . Whether or not weight based
on distance is considered, the category of d is usually
determined by most data records in the neighborhood[17].
Gradient Boosting Decision Tree (GBDT) was first proposed by Jerome
H.friedman [18]. The trees in GBDT are regression
trees, which can be used for regression prediction and classification.
The core of GBDT is that each tree learns the residual (negative
gradient) of the sum of all previous tree conclusions, which is the sum
of the true value after adding the predicted value.
The essence of the decision tree algorithm is the graph structure,
which, like KNN, is a non-parametric supervised learning algorithm. It
can summarize decision rules from a series of data with features and
labels and present them in a tree structure. A decision tree contains
root nodes, intermediate nodes, and leaf nodes. The decision tree
follows the principle of top-down segmentation[19], which means from the root node, the division
is carried out according to the principle of minimum impurity, and the
growth stops when the number of records in the node falls below the
preset threshold. The commonly used
measurement
methods of impurities are Gini index and information entropy[20]: