2.3.7 Bagging Model
Bagging38 is also known as bootstrap aggregating method, where T new data sets are obtained after the initial data sets are selected for T times. Obtained by putting back a sample (for example, to get a new data set of size n, each sample of which is sampled randomly from the original data set, i.e., after sampling and putting back).Based on each sampling and training, a basic learner is trained, and then these basic learners are combined. When combining the predicted output, Bagging usually adopts simple voting method for classification task and simple average method for regression task.Bagging focuses on reducing variance. The algorithm is as follows:A) the training set is extracted from the original sample set.N training samples are harvested per round from the original sample set using the Bootstraping method (in a training set, some samples may be harvested multiple times while others are not harvested at all).A total of m rounds were extracted to obtain m training sets.(k training sets are independent of each other) B) a model is obtained by using one training set at a time, and a total of m models are obtained by using m training sets.(note: there is no specific classification algorithm or regression method here, we can adopt different classification or regression methods according to specific problems, such as decision tree, perceptron, etc.) C) classification: the m models obtained in the previous step are voted to get classification results;For the regression problem, the mean value of the above models is calculated as the final result(all models are equally important). Its implified diagram is shown in Figure 8.
2.3.8 KNN Model
KNN (k-nearest Neighbor)39 works: there is a sample data set, also known as training sample set, and each data in the sample set has a label, that is, we know the relationship between each data in the sample set and its classification.After data without labels are input, each feature in the new data is compared with the corresponding feature of the data in the sample set, and the classification label of the data with the most similar feature (nearest neighbor) in the sample set is extracted.Generally speaking, we only select the first k most similar data in the sample data set, which is where k comes from in the k-nearest neighbor algorithm. Generally, k is an integer less than 20. Finally, the classification with the most occurrence of k most similar data was selected as the classification of the new data. KNN does not show the training process, which is the representative of ”lazy learning”. It only saves the data in the training stage, and the training time is 0, which will be processed after receiving the test samples.