Fig. 4 Structure of local optimization module.
The input information is divided into two branches by the “channel split” operation (Fig. 4). The feature maps are processed separately through two paths; one is not processed, and the other has 1 × 1, 3 × 3, and 1 × 1 convolution kernels and can extract deep semantic information. The channel conversion module explicitly models the dependencies between the channels, which can enhance important information. Residual connectivity solves the problem of network degradation and improves representational capability. The outputs of the two paths are spliced to ensure that the channel dimension remains unchanged. The “channel shuffle” operation disrupts the order of the channels to improve the efficiency of information transmission and promote information fusion. Finally, the upsampling operation of bilinear interpolation is used to obtain a high-resolution feature map.
Four feature maps of different resolutions are taken from the global regression module. The same dimensional feature maps are obtained by the local optimization module,
, (12)
, (13)
where is the local optimization module and denotes upsampling. Let . Then , , , and denote the feature maps at the 1/4, 1/8, 1/16, and 1/32 scales, respectively, of the original image. The result in (13) represents the processing times of the above four feature maps by the local optimization module, i.e., , , , and . At this time, the four feature maps have the same dimension, and the “concat” operation is performed as
. (14)
The 2.5D Gaussian heatmap of the joints of the hand obtained by 1 × 1 convolution is
. (15)
Experimental Results and Analysis: Datasets RHD and InterHand2.6M were used to evaluate the performance of the proposed method. The PyTorch framework was used for training. The hand image was resized to 256 × 256 and input to the network. In the experiment, the batch size was set to 16. The network was trained for 20 epochs with an NVIDIA 3090 GPU. The initial learning rate was set to 0.0001 and reduced by a factor of 10 at the 15th and 17th epochs to optimize the output of the network.