3.3.3 Random Forest Model
Random forest has excellent accuracy .Our data set includes two interrelated parts: drug activity values ​​and 204 eigenvalues which led to a higher dimensional data analysis. Samples with high dimensional features can be processed by Random forests and it can assess the importance of each feature in the classification problem. The credibility of the study will be increased on account of the obtained high correlation eigenbalues. The Pearson correlation coefficient ranking results show that the correlation coefficient between some features is greater than 0.4. (Figure 13) Principal component analysis was adopted in this study. Principal component analysis has no effect when the original variables are orthogonal to each other, so there is no correlation between the variables. The results of two-dimensional principal component analysis (pca) and three-dimensional principal component analysis (3d pca) show that dimensionality reduction make it easy to find representative features(Figure 14-15). The 204 features are scaled, and the features with variance greater than 0.05 are eliminated to obtain the most representative features. The Lasso regression model was used to further screen out nine features with low correlation and good orthogonality. Convert the strongly related variables to as few new variables as possible to replace the original variables. These new, unrelated variables represent various information in the original variables for high-dimensional data processing purposes. In the end, we got the model with the training set mean square error of 0.005 and R-squared of 0.77 (Figure 16). We believe that the predictions are credible.