IV. Experimental Results
To further evaluate the proposed framework, 19928 2-D fetal ultrasound
images collected by the center of prenatal diagnosis of The First
Hospital of Jilin University are utilized to form the experimental
dataset. Images are from the historical screening records of 2913
volunteers (cases) who has taken the standard ultrasound prenatal
examination in The First Hospital of Jilin University from August 2017
to March 2018. The gestational age of fetuses is ranging from 18 weeks
to 31 weeks. Examinations are conducted by a group of experienced
sonographers and GE Volution E8 ultrasound scanners are employed as the
screening equipment. In this project, volunteers are anonymous and
personal information has been removed from each image. Besides, the data
collection is fully under the supervision of the ethics committee of the
hospital.
All images are labelled into six categories, i.e. five types of fetal
head standard view planes and the background, by a set of experienced
sonographers. The standard planes are Transventricular plane (TV),
Transthalamic plane (TT), Transcerebellar plane (TB), Coronal view of
nose (Nose), Coronal view of eyes (Eyes), respectively. The background
includes other types of fetal ultrasound images, i.e. fetal head images
but not the standard views required by the screening and abdominal
views. The distribution of the images is illustrated in Figure 2. The
images are divided to three potions with the ratio of 8:1:1 for
training, validation, and testing, respectively. Since each case would
have multiple images involved, the case included in the training or
validation set is excluded from the testing set in order to avoid data
leakage. In the training stage, both training and validation sets are
utilized, while testing set is only applied for the performance
evaluation.
To train the YOLO, each image with the label, and the four coordinates
of its fetal head region, i.e., top-left corner and bottom-right corner,
are fed into the network. Here, only the image with fetal head region is
utilized for the training no matter it is standard head plane or
background since the number of images without fetal head region is very
limited and the fetal body lies in such background image is not the
interest of this work. The YOLO architecture used in this work is
designed for images with the size of 320 × 320 (width × height),
therefore each image is resized to 320 × 320 before input into the YOLO
while the batch size for training is set to 16 images. The performance
of YOLO is evaluated on the testing set, and the evaluation is
illustrated as Table 1. The last row representing the average values of
precision, recall, and F1-score of the six categories, respectively.
As illustrated in Figure 1, the local regions where fetal heads are
detected are cropped from the original images, such as the first two
images in which the head regions are found by YOLO. For background
images which no fetal head is shown, such as the third sample shown in
Figure 1, a pre-set rectangle window is applied so as to extract the
main image content from this type of images. The reason behind is to
preclude the fetal head images which could possibly missed by YOLO. New
samples are resized to 224 × 224 for ResNet50, ResNeXt50, and SonoNet64,
and 299 × 299 for InceptionResNet-V2. The settings for training of the
aforementioned series of networks are similar. Transfer learning is
conducted based on the models pre-trained on ImageNet. Adam is employed
as the optimizer while the initial learning rate is set to 0.001. The
learning rate is reduced by the factor of 0.1 if the validation loss is
not improved during the past 10 epochs. Cross-entropy is applied as the
loss function, and the model with the smallest validation loss is
selected for each type of classification networks when the training
process is accomplished. In this work, the categorical predicted
probabilities of each sample generated by each model are utilized as for
followed model stacking instead of the predicted labels. Therefore, 24
probabilities (six classes × 4 models) can be obtained for each sample.
Then, the model stacking is conducted based on the following equation,
\begin{equation}
\ {\tilde{Y}}_{i}=\text{argmax}_{c}\left(P_{i,c}\right),\nonumber \\
\end{equation}\begin{equation}
\text{\ P}_{i,c}=\left\{\frac{1}{m}\sum_{m}{p_{i,m,1},\ \ldots,}\frac{1}{m}\sum_{m}p_{i,m,c}\right\},\nonumber \\
\end{equation}where \(p_{i,m,c}\) is the probability of sample I belongs to
class c predicted by model m , \(P_{i,c}\) is the set of
categorically averaged probabilities generated by m different models,
and \({\tilde{Y}}_{i}\) is the final predicted label of sample i .
The performance of each individual network is evaluated on the testing
set, and the stacking model as well. The precision, recall, F1-score,
and accuracy averaged over six classes are employed as the metrics of
evaluation. As reported in Table 2, the proposed stacking model
outperforms each individual network.