IV. Experimental Results
To further evaluate the proposed framework, 19928 2-D fetal ultrasound images collected by the center of prenatal diagnosis of The First Hospital of Jilin University are utilized to form the experimental dataset. Images are from the historical screening records of 2913 volunteers (cases) who has taken the standard ultrasound prenatal examination in The First Hospital of Jilin University from August 2017 to March 2018. The gestational age of fetuses is ranging from 18 weeks to 31 weeks. Examinations are conducted by a group of experienced sonographers and GE Volution E8 ultrasound scanners are employed as the screening equipment. In this project, volunteers are anonymous and personal information has been removed from each image. Besides, the data collection is fully under the supervision of the ethics committee of the hospital.
All images are labelled into six categories, i.e. five types of fetal head standard view planes and the background, by a set of experienced sonographers. The standard planes are Transventricular plane (TV), Transthalamic plane (TT), Transcerebellar plane (TB), Coronal view of nose (Nose), Coronal view of eyes (Eyes), respectively. The background includes other types of fetal ultrasound images, i.e. fetal head images but not the standard views required by the screening and abdominal views. The distribution of the images is illustrated in Figure 2. The images are divided to three potions with the ratio of 8:1:1 for training, validation, and testing, respectively. Since each case would have multiple images involved, the case included in the training or validation set is excluded from the testing set in order to avoid data leakage. In the training stage, both training and validation sets are utilized, while testing set is only applied for the performance evaluation.
To train the YOLO, each image with the label, and the four coordinates of its fetal head region, i.e., top-left corner and bottom-right corner, are fed into the network. Here, only the image with fetal head region is utilized for the training no matter it is standard head plane or background since the number of images without fetal head region is very limited and the fetal body lies in such background image is not the interest of this work. The YOLO architecture used in this work is designed for images with the size of 320 × 320 (width × height), therefore each image is resized to 320 × 320 before input into the YOLO while the batch size for training is set to 16 images. The performance of YOLO is evaluated on the testing set, and the evaluation is illustrated as Table 1. The last row representing the average values of precision, recall, and F1-score of the six categories, respectively.
As illustrated in Figure 1, the local regions where fetal heads are detected are cropped from the original images, such as the first two images in which the head regions are found by YOLO. For background images which no fetal head is shown, such as the third sample shown in Figure 1, a pre-set rectangle window is applied so as to extract the main image content from this type of images. The reason behind is to preclude the fetal head images which could possibly missed by YOLO. New samples are resized to 224 × 224 for ResNet50, ResNeXt50, and SonoNet64, and 299 × 299 for InceptionResNet-V2. The settings for training of the aforementioned series of networks are similar. Transfer learning is conducted based on the models pre-trained on ImageNet. Adam is employed as the optimizer while the initial learning rate is set to 0.001. The learning rate is reduced by the factor of 0.1 if the validation loss is not improved during the past 10 epochs. Cross-entropy is applied as the loss function, and the model with the smallest validation loss is selected for each type of classification networks when the training process is accomplished. In this work, the categorical predicted probabilities of each sample generated by each model are utilized as for followed model stacking instead of the predicted labels. Therefore, 24 probabilities (six classes × 4 models) can be obtained for each sample. Then, the model stacking is conducted based on the following equation,
\begin{equation} \ {\tilde{Y}}_{i}=\text{argmax}_{c}\left(P_{i,c}\right),\nonumber \\ \end{equation}\begin{equation} \text{\ P}_{i,c}=\left\{\frac{1}{m}\sum_{m}{p_{i,m,1},\ \ldots,}\frac{1}{m}\sum_{m}p_{i,m,c}\right\},\nonumber \\ \end{equation}
where \(p_{i,m,c}\) is the probability of sample I belongs to class c predicted by model m , \(P_{i,c}\) is the set of categorically averaged probabilities generated by m different models, and \({\tilde{Y}}_{i}\) is the final predicted label of sample i . The performance of each individual network is evaluated on the testing set, and the stacking model as well. The precision, recall, F1-score, and accuracy averaged over six classes are employed as the metrics of evaluation. As reported in Table 2, the proposed stacking model outperforms each individual network.