Fig. 4 Structure of local optimization module.
The input information is divided into two branches by the “channel
split” operation (Fig. 4). The feature maps are processed separately
through two paths; one is not processed, and the other has 1 × 1, 3 × 3,
and 1 × 1 convolution kernels and can extract deep semantic information.
The channel conversion module explicitly models the dependencies between
the channels, which can enhance important information. Residual
connectivity solves the problem of network degradation and improves
representational capability. The outputs of the two paths are spliced to
ensure that the channel dimension remains unchanged. The “channel
shuffle” operation disrupts the order of the channels to improve the
efficiency of information transmission and promote information fusion.
Finally, the upsampling operation of bilinear interpolation is used to
obtain a high-resolution feature map.
Four feature maps of different resolutions are taken from the global
regression module. The same dimensional feature maps are obtained by the
local optimization module,
, (12)
, (13)
where is the local optimization module and denotes upsampling. Let .
Then , , , and denote the feature maps at the 1/4, 1/8, 1/16, and 1/32
scales, respectively, of the original image. The result in (13)
represents the processing times of the above four feature maps by the
local optimization module, i.e., , , , and . At this time, the four
feature maps have the same dimension, and the “concat” operation is
performed as
. (14)
The 2.5D Gaussian heatmap of the joints of the hand obtained by 1 × 1
convolution is
. (15)
Experimental Results and Analysis: Datasets RHD and InterHand2.6M
were used to evaluate the performance of the proposed method. The
PyTorch framework was used for training. The hand image was resized to
256 × 256 and input to the network. In the experiment, the batch size
was set to 16. The network was trained for 20 epochs with an NVIDIA 3090
GPU. The initial learning rate was set to 0.0001 and reduced by a factor
of 10 at the 15th and 17th epochs to optimize the output of the network.