Insert Figure 4.
One of AI’s significant benefits is its ability to scale intelligence at an unprecedented pace. In a time in which a clinician could diagnose a single patient, an AI system could analyze unlimited number of patients, at least in theory. However, the same scalability holds for mistakes and faulty diagnoses, and validation is of the utmost importance to prevent the lack of generalizability of AI models. AI models tend to ‘overfit’ on the training data, which results in a model that works seemingly well on the training population but poorly predicts future or other patient outcomes, especially in high-dimensional models. One example was from IBM’s Watson, which recommended unsafe cancer treatments, because it was trained on a sample size too limited for its dimensionality. For models to be more broadly applicable and generalizable to other populations, diligent validation and replication (in external datasets) are paramount. Unfortunately, this is often insufficient or altogether missing, respectively. Even FDA-approved AI applications fall short in this domain: Only 11/118 FDA applications (up until 2021) reported a validation set of more than 1,000 samples, and only 19/118 reported a multi-reader, multicenter validation study. Site-specific recalibration or retraining on multiple datasets are solutions to adapt a model to another context, although caution is required to avoid spurious learning patterns.
Randomized control trials (RCTs) or prospective validations are scarce in medical AI80,81. Most applications are only tested on retrospective data and have not passed prospective validation in an independent dataset. New guidelines have been emerging for reporting and evaluating RCTs with an AI intervention component in the past two years, such as the CONSORT-AI standard and SPIRIT-AI. A systematic review from 2022 reported that none of the 41 assessed RCTs adhered to this standard and suggested that AI applications with FDA approval do not always prove efficacy. Thus, the clinical utility and safety remain uncertain, providing a clear direction for future research to confidently implement AI in clinical medicine.

Ethical considerations

AI systems often rely on and are trained on confidential personal data, such as health records, imaging, or genomic data. The more voluminous these data become, especially with integrating multiple data sources and unlocking new data sources, the more critical privacy becomes. The EU’s General Data Protection Regulation (GDPR) already provides a ‘right to explanation’ when decisions are based on “automated processing” such as AI. There is a complicated relationship between privacy and trust. If the mechanisms of algorithms remain hidden for privacy reasons, this could also impede trust in the solution and slow down adoption by patients and clinicians. Furthermore, being overprotective of privacy in data collection, usage, and sharing can also hinder potential patient benefits from using these data to drive AI solutions for novel diagnostic or therapeutic options. Novel approaches are emerging that preserve privacy without slowing down innovation, such as the generation of synthetic data. Rather than (pseudo)-anonymizing samples, AI-generated synthetic data samples can be used for safe data sharing or even new model development.
While AI systems are not moral agents, their decisions can have ethical consequences. Especially bias and fairness are two key concepts in this context, and various cases of embedded biases exist in developed models. The 2021 AI action plan from the FDA, warns that biases in healthcare systems, such as racial or gender biases, can be inadvertently introduced to algorithms. This will lead to research conclusions and applications biased toward specific populations while overlooking others. If they are not corrected, this could further reinforce biases and exacerbate health inequalities experienced by certain underrepresented populations by excluding them from AI-driven medical innovations. Therefore, researchers need to ensure that the training sample is diverse and represents any future population to which the AI model will be applied.
While the above risks are important, it is essential to realize that humans are not free from implicit biases. For instance, cardiologists are trained to recognize symptoms of coronary artery disease more frequently in men, resulting in underdiagnosis in women. The advantage of data and algorithms is that biases may be detected, corrected, or prevented. From the study’s outset, during the data collection phase, investigators should strive for a representative training dataset that resembles the data distribution the algorithm would encounter once deployed. Before model development, guidelines have been defined to assess the risk of algorithmic bias, such as the PROBAST tool. Likewise, new techniques for the modeling phase are emerging that can help to mitigate bias, such as adversarial debiasing. Lastly, dedicated tools have been developed to evaluate the fairness of algorithms along a variety of fairness definitions, like the open-source Python library AI Fairness 360.
When the above considerations are not managed adequately, an AI system may make mistakes. This raises the intricate question of (moral) accountability, which becomes increasingly pressing with more clinical applications in place. However, the traditional notion of accountability is problematic in the context of an AI system. It is questionable whether a clinician can be held responsible for such a system’s decisions. Furthermore, the system’s complexity can make it infeasible for the clinician, and sometimes even the designer, to understand precisely why certain decisions are made. Therefore, we anticipate that the introduction of AI in clinical medicine will first be limited to decision support systems, with the final clinical decision to be made by the caring physician.

Clinical implementation

Despite exciting showcases, AI has been criticized for underdelivering tangible clinical impact. Translating solid AI models to effective action remains an open challenge and actual clinical use is still nascent. Recently, even with the surge of COVID-19-related AI research, the clinical value of AI applications remained limited. Important challenges for clinical implementation include questionable clinical advantages, inadequate reporting, and adoption and integration in clinical practices.
Developers of algorithms are also urged to be transparent and complete in their reporting to provide a fair view of improving patient care. RISE criteria (Regulatory aspects, Interpretability, Interoperability, Structured Data, and Evidence) can support overcoming major pitfalls in developing AI applications for clinical practice. Recently, the DECIDE-AI guideline has been introduced as a reporting checklist of AI-based (early-stage) clinical evaluation of decision support systems. In addition, clinicians and patients must adapt to working with and trusting new AI systems, and such behavioral change is notoriously hard. There is a need for (better) AI education for clinicians that will need to adapt to new roles and tools to support them in their decision-making. To smoothen this transition, integration into the medical education system has been proposed. The recent American Academy of Asthma, Allergy and Immunology workgroup has underscored a knowledge and an educational gap in the allergy and immunology field. Furthermore, interoperability of AI systems is vital to ensure that they can be integrated with existing clinical and technical workflows across sites and health systems.