Discussion
In this work, a hybrid framework is proposed for discriminating five
types of standard fetal head view planes in ultrasound images and the
background images. The contributions of this work can be summarized as
the following four points. 1) It is the first work that deep learning
technology is utilized to identify TV, TT, and TC to the best knowledge.
2) It is the first piece of work which YOLO-V3 is successfully applied
on this topic. 3) The design of the proposed framework which contains
both object detection network and object classification network while
model stacking technique is applied to further enhance the performance
is novel to this area. 4) The overall performance of the proposed
framework is the state-of-the-art.
Considering the combination of object detection network and object
classification network, one question would be intuitively raised, i.e.,
why both networks are needed to determine whether a given fetal
ultrasound image is one of the five types of standard view planes or
none of them? Firstly, it is known that object detection network is one
type of multi-task network which can achieve both object localization
and object classification. However, the optimization of both tasks are
simonteneously conducted which means it could be more difficult than
single tasking optimization, and the trade off between the two tasks
could make the network less effective than single tasking network on the
single task. Secondly, since the ultrasound image is always noisy, it is
necessary to extract the most important information which is the most
significant image content from the ultrasound image so as to make the
classification more efficient. Therefore, both object detection network
and classification network are deployed to from the proposed hybrid
framework. To further verify the aforementioned, a set of experiments
are designed. Both SonoNet64 and ResNet50 are trained with the original
ultrasound images instead of the fetal head region detected by YOLO. As
reported in Table 3, the performance of both networks which are trained
with original images is much lower than the networks trained with the
image content extracted by YOLO. It is also obervered that YOLO itself
is underperformed by both SonoNet and ResNet trained with the detected
fetal region. As a conclusion, the combination of both object detection
network and classification network is more effective than any single
type of them. Besides, as the classification network is employed, YOLO
can be more focused on locating possible fetal region in the candidate
image. Therefore, the classification loss of YOLO is multiplied a factor
of 0.5 so as to emphasis the optimization of the object localization
during the training. By investigating the testing results, the region
where fetal head is captured in each standard view plane image can be
located by YOLO successfully.
As illustrated in Table 2, the performance of SonoNet64, the benchmark
technique applied to identify fetal ultrasound standard planes which is
inspired by VGG16, is competitive with the other three benchmark neural
networks which are extremely popular in Computer Vision. Since all these
CNN based techniques are powerful enough on feature extracting, and it
is noticed that the complexity of the network is not always proportional
to the performance, such as InceptionResNet-V2 is underperformed by
other three. Therefore, to further improve the performance of the
state-of-the-art, model stacking is considered to be one of the
solutions. By reviewing the results shown in Table 3, it is observed
that the proposed stacking model outperforms the four benchmarks. The
Receiver Operating Characteristic (ROC) curves shown in Figure 3 also
indicate the effectiveness of proposed stacking while each AUC is close
to 1 and the average AUC is 0.9893.