Discussion
In this work, a hybrid framework is proposed for discriminating five types of standard fetal head view planes in ultrasound images and the background images. The contributions of this work can be summarized as the following four points. 1) It is the first work that deep learning technology is utilized to identify TV, TT, and TC to the best knowledge. 2) It is the first piece of work which YOLO-V3 is successfully applied on this topic. 3) The design of the proposed framework which contains both object detection network and object classification network while model stacking technique is applied to further enhance the performance is novel to this area. 4) The overall performance of the proposed framework is the state-of-the-art.
Considering the combination of object detection network and object classification network, one question would be intuitively raised, i.e., why both networks are needed to determine whether a given fetal ultrasound image is one of the five types of standard view planes or none of them? Firstly, it is known that object detection network is one type of multi-task network which can achieve both object localization and object classification. However, the optimization of both tasks are simonteneously conducted which means it could be more difficult than single tasking optimization, and the trade off between the two tasks could make the network less effective than single tasking network on the single task. Secondly, since the ultrasound image is always noisy, it is necessary to extract the most important information which is the most significant image content from the ultrasound image so as to make the classification more efficient. Therefore, both object detection network and classification network are deployed to from the proposed hybrid framework. To further verify the aforementioned, a set of experiments are designed. Both SonoNet64 and ResNet50 are trained with the original ultrasound images instead of the fetal head region detected by YOLO. As reported in Table 3, the performance of both networks which are trained with original images is much lower than the networks trained with the image content extracted by YOLO. It is also obervered that YOLO itself is underperformed by both SonoNet and ResNet trained with the detected fetal region. As a conclusion, the combination of both object detection network and classification network is more effective than any single type of them. Besides, as the classification network is employed, YOLO can be more focused on locating possible fetal region in the candidate image. Therefore, the classification loss of YOLO is multiplied a factor of 0.5 so as to emphasis the optimization of the object localization during the training. By investigating the testing results, the region where fetal head is captured in each standard view plane image can be located by YOLO successfully.
As illustrated in Table 2, the performance of SonoNet64, the benchmark technique applied to identify fetal ultrasound standard planes which is inspired by VGG16, is competitive with the other three benchmark neural networks which are extremely popular in Computer Vision. Since all these CNN based techniques are powerful enough on feature extracting, and it is noticed that the complexity of the network is not always proportional to the performance, such as InceptionResNet-V2 is underperformed by other three. Therefore, to further improve the performance of the state-of-the-art, model stacking is considered to be one of the solutions. By reviewing the results shown in Table 3, it is observed that the proposed stacking model outperforms the four benchmarks. The Receiver Operating Characteristic (ROC) curves shown in Figure 3 also indicate the effectiveness of proposed stacking while each AUC is close to 1 and the average AUC is 0.9893.