A. Multiscale Feature Fusion Network
Gesture pictures usually contain complex detailed features. A strong
correlation between fingers and joints is present. Therefore, the use of
a single feature for hand pose estimation tends to ignore diverse
feature information, which makes accurate extraction of more gesture
information difficult. Fig. 1 shows the proposed MS-FF, whose purpose is
to estimate the hand pose through a single RGB image. Feature maps of
different resolutions are extracted from RGB images through the ResNet50
module. Feature maps are fed into the channel conversion module to
explicitly learn the dependencies between channels, so as to enhance
important information and downplay minor information. Because the level
of feature information depends on the resolution of a feature map, the
global regression module obtains high-resolution feature maps containing
more semantic information, and these are separately input in the local
optimization module to extract deeper information. The Gaussian heatmap
of hand joints () is obtained to improve the spatial generalization
ability of the model, and thus obtain more accurate joint locations. We
take the feature map with the smallest resolution from the channel
conversion module, through which the handedness () and relative depth
information between the wrist joints () are obtained. The above results
are combined to estimate the hand pose,
, (1)
, (2)
where equation (2) represents the result of gesture estimation, and and
are the camera inverse projection and inverse affine transformation,
respectively.