Object Detection Network
Object detection, particularly with deep neural networks (DNNs) instead
of hand-crafted features, has been addressing immense attention in the
past decade and tremendous development has been delivered. Generally,
two types of DNN based object detection frameworks are dominating, i.e.,
(1) one-stage networks, e.g., all variants of YOLO(16) and SSD(17); (2)
two-stage networks, e.g., all variants of RCNN such as Faster RCNN(18).
The major difference between one-stage and two-stage networks is the
utilization of Region Proposal Network (RPN)(18). In one-stage networks,
both localization and classification are conducted simultaneously within
the same network, while RPN are employed specifically for object
localization in parallel with the feature extraction in two-stage
networks. Therefore, two-stage networks are normally more precise on
object localization especially for small objects, but more time and
computing resource consuming on the training stage and the inference
stage as well. In the proposed framework, the object detection network
is utilized to locate the fetal head related region in the candidate
sources, i.e., ultrasound images, and predict the corresponding label
for the candidate based on the detected image content. Later on, the
located fetal head region is to be cropped from the source image and
further input into the followed fine-grained classification network to
get the final prediction of the source image. In this work, YOLO-v3 with
Darknet-53 (14) as the backbone of feature extractor is employed for
locating fetal head region in the proposed framework.
YOLO has been proved its efficiency in enormous prior works. In order to
elevate the performance of vanilla YOLO, more powerful backbone and more
advanced design are included in the architecture of the latest version
of YOLO which is named as YOLO-V3. The whole network of YOLO-V3 can be
considered as the composition of two subnetworks, Darknet-53 for the
feature extraction and Feature Pyramid Network (FPN) for multi-scale
object detection. In Darknet-53, the idea of ResNet (19) is
comprehensively injected. Multiple residual modules (RMs), which are 23
RMs in total, are employed and divided into five groups. Each RM is
composited with two convolutional layers with different filter sizes,
i.e., 1 × 1 and 3 × 3. After each RM group but not the last RM group, a
convolutional layer with the stride as 2 is connected for down-sampling.
Therefore, the dimensionality of the output of the down-sampling layer,
i.e., height x width, is reduced to 1/4 of the feature map produced by
the previous RM group. Also noted that, each convolutional layer is
followed by a Batch normalization layer and a non-linear activation
layer equipped with leaky ReLU(20). Batch normalization (BN) (21) layer
is designed to diminish the internal covariate shift (21) which could
easily lead gradient vanishing in the training of deep neural networks.
Besides, BN introduces regularization to the network which enhances the
generalization of the whole network. Therefore, the utilization of BN
makes the training of neural networks particularly extremely deep neural
networks become possible. Similarly, leaky ReLU is able to reduce the
risk of gradient vanishing occurred in back propagation of deep neural
networks as well thus to accelerate the training process. More
importantly, non-linearity allows networks to model more complicated
problems such as most of real-life problems.
In YOLO-V3, FPN is also applied in order to improve the performance of
previous YOLOs. By investigating the multi-scale feature
representations, which are the output of layers of the feature pyramid,
the network could find the most accurate locations of objects in variate
scales, and thus determine what they are. Three different scales are
considered in the FPN of YOLO-V3, while three candidate boxes which are
representing the possible locations of the detected object are predicted
at each scale. For each box, a set of parameters, which includes the
coordinates of the centroid and the offsets of the width and the height,
is to be generated by the network so as to indicate a certain region in
the original image. Besides, a confidence score and the class
predictions are also produced for each predicted bounding box.
2. Object Classification Network and Model Stacking
Since the invention of more advanced computing hardware and the
explosive growth of the data, more complicated neural networks are
allowed to be built and trained. However, it is noticed that a deeper
network not always gains a better performance but a higher error rate
and a degradation of the accuracy. To resolve this problem, the idea of
residual learning is brought by ResNet(19) which is considered as one of
the most significant progress on deep learning. By utilizing identity
mapping (22), also known as residual skipping connection, information
can easily pass through even between the deeper layers. At the meantime,
the residual connection could effectively avoid gradient vanishing even
weights are extremely small. Therefore, deeper neural networks with RM
are much easier to get converged and achieve a better result, so as
YOLO-V3. Based on this RM technique, ResNet, InceptionResnet(23),
ResNeXt(15) are proposed.
ResNet as one of the most popular CNN based network is designed to
overcome the difficulty of training deep neural networks. Before ResNet
is proposed, a deeper neural network is hard to train and the
degradation of the performance is always happened. The problem comes
from the gradient vanishing occurred in the training. By introducing the
residual short cut, which is the essential idea of RM, the gradient can
easily flow in the network during the back-propagation. Consequently,
the efficiency of training is dramatically improved and the performance
is boosted. Inception is another milestone. Not only the depth of the
network is increased in Inception, but also the width is extended as
well. It tries to extract various types of features from the same source
(image or feature maps) so as to enhance the ability of the network.
With injecting the idea of ResNet, InceptionResNet-V2 is developed and
more successful results are achieved. In ResNeXt, multiple RMs are
successively stacked which is based on the idea of VGG(24) in order to
increase the depth of the whole network. Besides, not only could the
depth dramatically influence the performance of a DNN, but the width and
the cardinality are both essential to the DNN as well. Considering all
these factors, the idea behind Inception is also adopted by ResNeXt. For
each RM, the single pathway is replaced by multi-branch topology which
is also known as split-transform-merge structure. The hyperparameters of
each branch within a multi-branch RM is set to be identical. This
modified RM is more powerful than the vanilla single branch RM.
Comparing with complicated dense structures, the modified multi-branch
RM is achieved very close performance while it has much lower
computational complexity.
After the inference of YOLO-V3, all possible fetal head regions within
candidate images are assumed to be located. Instead of relying on the
classification of YOLO-V3, all detected regions are sent to a set of
powerful networks specifically designed for the object classification.
The final prediction of each suspected image regions is to be made based
on the stacked outputs of the classification networks so as the
corresponding original image.