The results of separate models from P-x, LC-A, and LC-B are shown inTable 2 . For the three sequences (i.e., T2, ADC, and hDWI), the AUCs of three separate models are relatively high when tested within their domains, but the AUCs sharply drop when directly tested in the unseen domains. Such results show the sensible cross-domain discrepancy (i.e. domain shift) among the four datasets. Note that, in terms of the T2 sequence, separate models of LC-A and LC-B accomplish the highest testing AUCs (0.66 and 0.67) in the unseen domain, LC-C, just marginally higher than the ones (0.61) within their corresponding domains. A potential reason for the biased predictions is the deficiency of testing samples (i.e. 29) on LC-C. When it comes to the joint models in the table, they cannot bring remarkable improvements in each sequence compared with the separate models, instead, even may lead to performance degradation due to cross-site heterogeneity.
With severe discrepancies among our datasets, we intend to validate whether the rigorous MR image preprocessing methods can contribute to the joint models’ classification performance. Similar to scaled, whitening is another common preprocessing method, capable of normalizing the pixel values with a mean of zero and a variance of unit. Taking the combined dataset, P-x and LC-A, as a representative for evaluation. InTable 3 , scaled, whitening, and their combined function with bias field correction (BFC) or noise filtering (NF), 6 preprocessing methods in total, were adopted as in [35]. The joint models using scaled and whitening acted as the two baselines for comparisons with the rigorous MR image preprocessing methods (i.e. BFC and NF). Figure 1 depicts the image preprocessing examples of three methods (i.e. whitening, whitening + BFC, and whitening + NF). The left and right halves of each sample represent before and after preprocessing, respectively. Before preprocessing, we can observe noticeable intensity distribution discrepancies on the samples. The samples from LC-A are characterized by larger numbers of low-intensity grayscale pixels as compared with the images of P-x. Subsequently, the jet color maps were employed to highlight the intensity distribution between domains after preprocessing. All the color maps share the same intensity color scale. Similar intensity distributions can be found among the samples after preprocessing, demonstrating the effectiveness of the methods in image distribution harmonization.
In Table 3 , for the T2 sequence, BFC with either scaled or whitening outperforms the baselines. Besides, BFC with whitening achieves best AUCs of 0.91 and 0.80 on P-x and LC-A, respectively. However, these findings are not consistent with the results in ADC and hDWI. In terms of ADC, the models preprocessed with BFC or NF underperform the baselines. Instead, the baseline models receive the highest AUCs, where scaled alone and whitening alone accomplish 0.73 and 0.72 on P-x and LC-A, respectively. When it comes to the sequence of hDWI, either BFC or NF attributes limited improvement over the baselines. On P-x, the AUC increases marginally from 0.73 (scaled only) to 0.80 (scaled with NF); on LC-A, only an AUC of 0.65 is achieved using scaled with BFC. The above results of the three sequences show that these pre-processing approaches could improve CM-Net’s classification performance when combing our two datasets. However, none of the methods is capable of boosting the joint models’ generalization considerably, as compared with the separate models of P-x and LC-A (in Table 2 ). This indicates that the preprocessing methods are probably insufficient to solve domain shift fundamentally. A possible reason is that the severe discrepancies do not come from the inter-site discrepancies (inTable 1 ), rather than the intensity distribution of the heterogeneous mpMRI sequences only (see details in SupplementaryFigure 2 ).

2.3. Cross-domain Malignancy Classification and Lesion Detection

We emphasize the importance of knowledge transfer from a large-scale publicly dataset to a small-scale target domain. The malignancy estimation performance of CMD²A-Net (the architecture is shown inFigure 4 and described in detail in the Methods section) is evaluated. Dataset, P-x, is only regarded as the source domain. Either LC-A or LC-B is also set as the source domain for knowledge transfer between local cohorts. The scaled method was employed for image preprocessing. In general, available types of MR sequences may vary in healthcare institutions. Thus, we employed ensemble learning to handle multiple sequences, allowing the use of single and multiple sequence(s) in our framework. Three common metrics were adopted for classification performance evaluation, i.e. AUC, sensitivity (SEN), and specificity (SPE).
Table 4. Malignancy classification results in the target domains in four combinations of source-target domain.