2.5. Modeling
In this study, five prediction models of RF, GBDT, decision tree, KNN, and Naïve Bayes models were established and compared.
Naïve Bayes is the Bayesian theory based on probability theory and mathematical statistics It is a supervised algorithm that directly measures the probability relationship between labels and features[15]. Naïve Bayes’ simplicity makes the model run quickly. However, the condition of its validity is that the features of samples are independent [16]. That is difficult in practical application.
K-Nearest Neighbor (KNN) is a simple non-parametric classification method. Assuming the data setdto be classified, k of its nearest neighbors was retrieved and calculated as the neighborhood of d . Whether or not weight based on distance is considered, the category of d is usually determined by most data records in the neighborhood[17].
Gradient Boosting Decision Tree (GBDT) was first proposed by Jerome H.friedman [18]. The trees in GBDT are regression trees, which can be used for regression prediction and classification. The core of GBDT is that each tree learns the residual (negative gradient) of the sum of all previous tree conclusions, which is the sum of the true value after adding the predicted value.
The essence of the decision tree algorithm is the graph structure, which, like KNN, is a non-parametric supervised learning algorithm. It can summarize decision rules from a series of data with features and labels and present them in a tree structure. A decision tree contains root nodes, intermediate nodes, and leaf nodes. The decision tree follows the principle of top-down segmentation[19], which means from the root node, the division is carried out according to the principle of minimum impurity, and the growth stops when the number of records in the node falls below the preset threshold. The commonly used measurement methods of impurities are Gini index and information entropy[20]: