2. Materials and Methods
2.1. Data set and preprocessing. The image data of urinary particles were obtained from the urine samples of 384 patients at the Shenzhen Sixth People’s Hospital of Guangdong Province. Before the picture was collected, informed consent was obtained from each patient. The study was approved by the Ethics Review Committee of Shenzhen University. The method of database establishment was simple. An appropriate amount of urine sample was taken into the U-shaped area of the urine smear device (Figure 1a). Then, the U-shaped area with the urine sample was placed into the microscopic imaging system (Figure 1b) for data acquisition. When the magnification of the microscope was too low, the size of the cells in the image was very small, and thus, not conducive to network training and object detection. When the magnification of the microscope was too high, the number of cells in a single image was small and was, thus, again not conducive to the establishment of database. Therefore, we choose a 40× objective lens for data collection. The acquired images (Figure 1c) had a resolution of 1536 × 1024, which are all RGB three-channel color images, in which each sample randomly acquires 20-cell morphology images under a 40-fold objective lens.
We invited three clinically experienced experts to label the cells in the morphological images using the commonly used labeling software LabeImg for deep learning, as shown in Figure 1d. Fifteen different cell types of image data were obtained. We randomly divided the data into training and test sets. The ratio of the number of images in the training and test sets was 7:3. To enhance the robustness of the network model and make it highly generalizable for different test images, we had to enhance the image data. The data enhancement methods used were geometric transformation, adding noise, and changing contrast and brightness. After data enhancement, we could train the data on the network.
2.2. Network model. The RetinaNet network22includes one backbone network and two sub-networks. The structure diagram of the network model is shown in Figure 2. The backbone network comprises the resnet and FPN modules24 that are responsible for feature extraction of the cells and generate many different sizes of the feature maps. On the other hand, the sub-network comprises a classification sub-network and a regression sub-network for the classification and location of objects, respectively.
The FPN network is the core module of the network, and its structure is shown in Figure 3. It mainly comprises two processes: top-up and top-down. In the top-up process, as the network deepens, the spatial dimension is gradually halved, whereas in the top-down path, the corresponding convolutional module layer is output through a 1 × 1 convolution filter and is then added to the upper-level up-sampling (top level exception). Finally, the feature map of each layer is obtained by the convolution of 3 × 3. The FPN network combines multiple layers of feature information to enable the network to better handle small objects like cells.
In the classification sub-network, for each layer of the feature pyramid output in the FPN, a 4-layer 3 × 3 convolution is used, followed by a ReLU activation function, which is then input into a 1-layer 3 × 3 convolution, and the number of convolution kernels is KA(K represents the number of categories. In the experiment, we take K=15, A=9). Finally, the sigmoid activation function is used for category output. The regression sub-network has the same structure as the classification sub-network; however, each uses different parameters.
It is noteworthy that to solve the problem of imbalance between the background class (no object) and the foreground class (including the object) in the one-stage target detection algorithm, the network introduces an optimized loss function, as shown in Equation 1. This loss function adjusts the weight of the easily categorized sample (background class), thus improving the detection speed of the model.
\(\text{FL}\left(\text{Pt}\right)={-\alpha_{t}\left(1-\text{Pt}\right)}^{\gamma}\log(Pt)\)(1)
Where Pt represents the probability that the category is predicted, αt is a weighting factor between 0 and 1, and γ is a modulation factor that can control the weighting rate of the easily categorized samples. In the experiment, αt was 0.25 and γ was 2.0.
2.3. Model training. After network construction was completed, we normalized the image data of the training set containing 15 types of urine cells and then input them into the network model for training in batches. In the initialization of the network parameters, we use the Gaussian weight initialization method with a standard deviation of 0.01 and bias of 0. In the training parameter setting, we set the momentum to 0.9, weight attenuation to 0.0005, learning rate to 1e−4, and input eight images each time for network training. The model optimization method used was Adam25. In the experiment, we used the Keras deep learning model framework to perform network training on a 64-bit Ubuntu 16.04.5 system. The computer had the following configurations: Nvidia 1080 GPU, i7-6600 CPU, and 16G memory deep learning server. We used the training set to iterate the entire model, and after one epoch, we tested the model parameters using the validation set and then saved the best model parameters.
2.4. Model evaluation method. We used mAP, which is commonly used in deep learning, to evaluate the performance of the model. In addition, the time taken by the computer to process a single image was also considered to be an indicator of the evaluation. For a certain type of cells in an image (for example, crystals, replaced by the letter C below), the model could correctly detect the number of C as x and the total number of C as y; the accuracy of category C in this image can be expressed as P, and the average accuracy of n pictures as AP. Using this method, we could calculate the accuracy rates of the 15 types of urine cells (AP1, AP2, … AP15). mAP is the average of the accuracy rates of all 15 types. It is calculated as follows.
\(p=x/y\) (2)
\(AP=(\sum_{t=1}^{n}\text{Pt})/n\) (3)
\(mAP=(\sum_{t=1}^{m}\text{APt})/m\) (4)