Figure S1. Spectral and spatial information for Sentinel-2 MSI
and Landsat-8 OLI (Changed from Sentinel-2 MSI Level 2A Products
Algorithm Theoretical Basis Document)
3.Method
3.1Random Forest
As a relatively new machine learning model, random forest can predict
the role of up to several thousand variables and is regarded as one of
the best machine learning algorithms. The random forest classification
algorithm is an algorithm based on the classification and regression
tree (CART) invented by Breiman et al., which integrates multiple
decision trees through the idea of integrated learning. Its basic unit
is the decision tree. If the CART decision tree is seen as an expert in
the classification task, random forests are the experts that classify a
task together (Iverson et al. 2008; Breiman 2001).
The steps for establishing a random forest are as follows:
(1) In the original sample, N training samples are randomly and
regressively extracted, which is called the bootstrap method. This
method is used to form a training sample set, and the data for each
training sample set are approximately two-thirds of the original sample
data set.
(2) Based on the extracted training sample set, N CART decision trees
are constructed to form a random forest. During the decision tree growth
process, m features are randomly selected at each node of each tree (the
total number of features is M, m≤M). According to the principle of
minimum Gini coefficient, a feature with the most classification ability
was selected to perform node splitting within the decision tree.
(3) Multiple decision trees are generated to form a random forest
classifier, which is used to classify remote sensing images and
determine the category by voting.
The random forest algorithm not only enables the classification of
remote sensing images, but it also plays an important role in feature
selection and dimensionality reduction. Since approximately one-third of
the original sample data is not extracted during the sampling process,
this part of the data is called the Out-Of-Bag data (OOB data).
Out-Of-Bag-Error (OOB Error) generated by OOB data evaluates
classification accuracy and also calculates the importance of different
feature variables (Variable Important, VI) for feature selection (Genuer
et al. 2010; Sandri et al. 2012). The characteristic variable importance
assessment model is as follows:
\(\text{VI}\left(M_{A}\right)=\frac{1}{N}\sum_{t=1}^{N}{(B_{n_{t}}^{M_{A}}-B_{O_{t}}^{M_{A}})}\),
where VI indicates the importance of the characteristic variable,M is the total number of features of the sample, N is the
number of trees in the generated decision tree, \(B_{O_{t}}^{M_{A}}\) is
the OOB error of the t-th decision tree when any eigenvalue \(M_{A}\) is
not added with noise interference, and \(B_{n_{t}}^{M_{A}}\) is the OOB
error of the t-th decision tree when any eigenvalue \(M_{A}\) is added
with noise interference. If a certain feature \(M_{A}\) is randomly
added with noise, and the accuracy of the OOB data is greatly reduced,
this indicates that the feature \(M_{A}\) has a great influence on the
classification result, and it also indicates that its importance is
relatively high.
In the current study, the EnMAP-BOX tool developed by the German
environment mapping and analysis program project team was used for band
optimization and native and invasive species extraction. There are two
important parameters in the process of constructing the random forest
algorithm, namely, the number N of decision trees in the random forest
and the number m of features extracted during the node-splitting
process. When extracting feature variables, we selected the arithmetic
square root of the total number of features in the EnMAP-BOX tool as the
number of features. In theory, the greater the number of decision trees
N, the higher the classification accuracy, but the higher the time cost.
Based on the determination of the extracted feature m, we found that
when the number of decision trees is N≥20, the OOB error gradually
converges and tends to be stable. Therefore, we chose N=20 as the number
of generated decision trees.
3.2Accuracy assessment of species
classification
The confusion matrix is also called the error matrix. It is mainly used
to compare the degree of confusion between the classification result and
the actual measured value for accuracy evaluation. In the current study,
the overall accuracy (OA), Kappa coefficient, producer accuracy (PA),
and user accuracy (UA), which are commonly used at present, were
selected as evaluation indicators to evaluate the classification results
of different remote sensing images.
3.3Landscape index
The landscape index refers to the highly concentrated landscape pattern
information, which reflects the simple quantitative indicators of its
structural composition and spatial configuration, and is suitable for
quantitative spatial analysis of the relationship between landscape
pattern and ecological process. Due to the large number of landscape
indices, previous research was consulted, and the class area (CA), patch
density (PD), largest patch index (LPI), splitting index (SPLIT), and
aggregation index (AI) were selected for obtaining the spatial and
temporal changes in the habitat pattern of native and invasive species
(Zhen et al. 2012; Liu et al. 2017).
Table S1. Description of landscape index