Fig. 3 Structure of global regression module.
Feature maps are obtained by the channel conversion module. and have high spatial resolution but low semantic information, while and have more semantic information but low spatial resolution. In addition to obtaining rich hand feature information, the fusion of feature maps can obtain detailed information, such as that of fingertips and masked edges. To fuse the feature information, feature maps in different dimensions are subjected to dimensionality reduction, so that their channels can be unified under the same dimension,
, (6)
, (7)
where Vk is the feature map obtained by dimensionality reduction, Uk is the feature map obtained by upsampling, R1 is the convolution operation with a 1 × 1 convolution kernel, is the ReLU function, andB is the upsampling operation of bilinear interpolation, which calculates the corresponding points in the new image by the four adjacent points as
, (8)
, (9)
. (10)
Equations (8) and (9) are linear interpolation operations in thex -direction, and equation (10) is a linear interpolation operation in the y -direction. , , , and are points in the original image with coordinates , , , and , respectively. and are added to fuse feature information of different spatial resolutions. The calculation method is
. (11)

D. Local Optimization Module

To reduce errors generated by the global regression module, a local optimization module addresses the inaccuracy of predicting the joint position under occlusion. This can extract deeper information from feature maps obtained by the global regression module.