Fig. 2 Structure of channel conversion module.
The channel conversion module has aggregation and excitation stages.
Global feature information of spatial dimension is aggregated into a
channel descriptor of dimension by average pooling. The c-th element
calculation of vector A is
, (3)
where and are the height and width, respectively, of the feature map;
and is the pixel in the c -th channel. The average feature of each
channel in the feature map is calculated by aggregation.
To fully utilize the aggregated feature information, the excitation
operation captures the dependencies between channels. The aggregated
information learns the inter-channel dependencies through the fully
connected layer. The weight vector with dimension is obtained by the
sigmoid function, and can characterize the importance of each channel.
The weight vector is multiplied by the original feature map to obtain
the reassigned feature map, which can enhance important information and
weaken the minor information. The dependency between the channels is
where is the calculation of channel weights, and are the weight matrices
of the two fully connected layers, is the sigmoid function, and is the
ReLU function. The channel information of the feature map is
recalibrated as
, (5)
where is the feature map after reassigning channel weights, and is
channel-wise multiplication between the weight vector and feature vector
.
C. Global Regression
Module
The ResNet50 module produces feature maps with different resolutions.
High-resolution, low-level feature maps contain less semantic
information but rich spatial detail information, while low-resolution,
high-level feature maps have rich semantic information and less spatial
detail information. To fully exploit the feature information of
different dimensions, the low- and high-resolution feature maps are
combined by vertical and horizontal paths. The vertical path obtains the
high-resolution feature map by upsampling the spatially low-resolution
feature map. Then, 1 × 1 convolution is used to reduce the number of
channels in the low-level feature map, so as to obtain a feature map
with the same dimension as the corresponding longitudinal path feature.
The horizontal path fuses the two feature maps (Fig. 3). This pyramidal
structure allows feature maps of different resolutions to contain more
semantic information, enabling the network to learn richer feature
information.