Object Detection Network
Object detection, particularly with deep neural networks (DNNs) instead of hand-crafted features, has been addressing immense attention in the past decade and tremendous development has been delivered. Generally, two types of DNN based object detection frameworks are dominating, i.e., (1) one-stage networks, e.g., all variants of YOLO(16) and SSD(17); (2) two-stage networks, e.g., all variants of RCNN such as Faster RCNN(18). The major difference between one-stage and two-stage networks is the utilization of Region Proposal Network (RPN)(18). In one-stage networks, both localization and classification are conducted simultaneously within the same network, while RPN are employed specifically for object localization in parallel with the feature extraction in two-stage networks. Therefore, two-stage networks are normally more precise on object localization especially for small objects, but more time and computing resource consuming on the training stage and the inference stage as well. In the proposed framework, the object detection network is utilized to locate the fetal head related region in the candidate sources, i.e., ultrasound images, and predict the corresponding label for the candidate based on the detected image content. Later on, the located fetal head region is to be cropped from the source image and further input into the followed fine-grained classification network to get the final prediction of the source image. In this work, YOLO-v3 with Darknet-53 (14) as the backbone of feature extractor is employed for locating fetal head region in the proposed framework.
YOLO has been proved its efficiency in enormous prior works. In order to elevate the performance of vanilla YOLO, more powerful backbone and more advanced design are included in the architecture of the latest version of YOLO which is named as YOLO-V3. The whole network of YOLO-V3 can be considered as the composition of two subnetworks, Darknet-53 for the feature extraction and Feature Pyramid Network (FPN) for multi-scale object detection. In Darknet-53, the idea of ResNet (19) is comprehensively injected. Multiple residual modules (RMs), which are 23 RMs in total, are employed and divided into five groups. Each RM is composited with two convolutional layers with different filter sizes, i.e., 1 × 1 and 3 × 3. After each RM group but not the last RM group, a convolutional layer with the stride as 2 is connected for down-sampling. Therefore, the dimensionality of the output of the down-sampling layer, i.e., height x width, is reduced to 1/4 of the feature map produced by the previous RM group. Also noted that, each convolutional layer is followed by a Batch normalization layer and a non-linear activation layer equipped with leaky ReLU(20). Batch normalization (BN) (21) layer is designed to diminish the internal covariate shift (21) which could easily lead gradient vanishing in the training of deep neural networks. Besides, BN introduces regularization to the network which enhances the generalization of the whole network. Therefore, the utilization of BN makes the training of neural networks particularly extremely deep neural networks become possible. Similarly, leaky ReLU is able to reduce the risk of gradient vanishing occurred in back propagation of deep neural networks as well thus to accelerate the training process. More importantly, non-linearity allows networks to model more complicated problems such as most of real-life problems.
In YOLO-V3, FPN is also applied in order to improve the performance of previous YOLOs. By investigating the multi-scale feature representations, which are the output of layers of the feature pyramid, the network could find the most accurate locations of objects in variate scales, and thus determine what they are. Three different scales are considered in the FPN of YOLO-V3, while three candidate boxes which are representing the possible locations of the detected object are predicted at each scale. For each box, a set of parameters, which includes the coordinates of the centroid and the offsets of the width and the height, is to be generated by the network so as to indicate a certain region in the original image. Besides, a confidence score and the class predictions are also produced for each predicted bounding box.
2. Object Classification Network and Model Stacking
Since the invention of more advanced computing hardware and the explosive growth of the data, more complicated neural networks are allowed to be built and trained. However, it is noticed that a deeper network not always gains a better performance but a higher error rate and a degradation of the accuracy. To resolve this problem, the idea of residual learning is brought by ResNet(19) which is considered as one of the most significant progress on deep learning. By utilizing identity mapping (22), also known as residual skipping connection, information can easily pass through even between the deeper layers. At the meantime, the residual connection could effectively avoid gradient vanishing even weights are extremely small. Therefore, deeper neural networks with RM are much easier to get converged and achieve a better result, so as YOLO-V3. Based on this RM technique, ResNet, InceptionResnet(23), ResNeXt(15) are proposed.
ResNet as one of the most popular CNN based network is designed to overcome the difficulty of training deep neural networks. Before ResNet is proposed, a deeper neural network is hard to train and the degradation of the performance is always happened. The problem comes from the gradient vanishing occurred in the training. By introducing the residual short cut, which is the essential idea of RM, the gradient can easily flow in the network during the back-propagation. Consequently, the efficiency of training is dramatically improved and the performance is boosted. Inception is another milestone. Not only the depth of the network is increased in Inception, but also the width is extended as well. It tries to extract various types of features from the same source (image or feature maps) so as to enhance the ability of the network. With injecting the idea of ResNet, InceptionResNet-V2 is developed and more successful results are achieved. In ResNeXt, multiple RMs are successively stacked which is based on the idea of VGG(24) in order to increase the depth of the whole network. Besides, not only could the depth dramatically influence the performance of a DNN, but the width and the cardinality are both essential to the DNN as well. Considering all these factors, the idea behind Inception is also adopted by ResNeXt. For each RM, the single pathway is replaced by multi-branch topology which is also known as split-transform-merge structure. The hyperparameters of each branch within a multi-branch RM is set to be identical. This modified RM is more powerful than the vanilla single branch RM. Comparing with complicated dense structures, the modified multi-branch RM is achieved very close performance while it has much lower computational complexity.
After the inference of YOLO-V3, all possible fetal head regions within candidate images are assumed to be located. Instead of relying on the classification of YOLO-V3, all detected regions are sent to a set of powerful networks specifically designed for the object classification. The final prediction of each suspected image regions is to be made based on the stacked outputs of the classification networks so as the corresponding original image.