Article
Being able to link clinical outcomes to virus strains is a critical component of understanding COVID-19, however current data collection practices hamper such analyses and require updating to support robust insights gained from the data collected.
GISAID, established originally as the Global Initiative on Sharing All Influenza Data(Elbe & Buckland-Merrett, 2017) has widened its remit with the EpiCoV database to become the principal platform for the sharing of genomic sequences of SARS-CoV-2 (hCoV-19) from around the world. Such convergence by the global scientific community around a single database is critical to permit a near-real-time analysis of how the virus is evolving. While currently only 1 out of 165 confirmed cases (“Worldometers coronavirus,” n.d.) sees the virus sequence submitted (i.e. 7,663,708 COVID-19 cases and 46,251 published SARS-CoV-2 sequences as of 13 June 2020), it nonetheless represents the most thorough surveillance of an emerging virus outbreak in history (“Massive coronavirus sequencing efforts urgently need patient data - Nature India,” n.d.).
It is therefore critical to supplement the collected information on the virus’s genome with the other critical component informing patient outcome: medical information. Such de-identified patient data would provide the missing information that enables the virus’s evolution to be linked to its host’s clinical factors. For example, several studies have suggested the emergence of virus isolates associated with greaterin vitro titres and cytopathic effects(Yao et al., 2020), greater transmissibility(Korber et al., 2020), higher fatality(Becerra-Flores & Cardozo, 2020), aggressive(Banerjee, Dhar, Bhattacharjee, & Bhattacharjee, 2020), attenuated(Su et al., 2020) or similar(Zhang et al., 2020) phenotypes with consequent outcomes.
These observed variations, especially disease severity and outcomes, may be attributable to genomic evolution and adaptation to the new human host. However, current analyses are confounded by factors such as co-morbidities, capacity of the health care system in terms of diagnostic testing, treatment choices, and reporting of severity and fatality – making it impossible to robustly link patient outcome to genomic changes in the virus. This limits studies to being merely observational by reporting genomic differences of the virus(Bauer et al., 2020) or inferring pathogenicity from cell culture measurements such as replication rate(Yao et al., 2020) and cell toxicity(Chu et al., 2020). While such in silico and in vitro studies are insightful, they are not a reliable predictor of disease severityin vivo .
Recognizing the need for clinical data, GISAID enables “patient status” to be recorded for each submitted isolate, but typically only 3% have provided relevant information. For instance, 9% (506/5122) of submitted isolates have this field filled in and of these only 33% (164) have provided clinical information as of 15 May 2020 (Figure 1). This highlights two areas where current processes hamper sustainable and meaningful data collection. Firstly, information is currently not captured in a standardized form that is tailored to COVID-19 infections; secondly patient information is frequently not available when genomic information is submitted, and workflows are not set up to amend entries retrospectively.

1. Capturing clinical data in standardised forms

Data that is collected and submitted to a central repository such as GISAID likely comes from multiple sources, with consequently a wide range of digital-readiness levels. For example, it might be extracted from Electronic Medical Records (EMRs) where the data is already in a structured form. However, it may also be that relevant information needs to first be extracted out of digital or paper based clinical notes. In the latter case, the same clinical symptom might be described differently, complicating downstream reporting or grouping of records. Hence converting clinical observations into standardized terms, so called clinical terminologies that are applicable across the world, is relevant (Figure 2).
While the progression towards EMRs is a much larger, multilayer problem that cannot be addressed quickly even or especially amid a pandemic, the mode of primary data collection into the central repository can be controlled by introducing standardised fields implementing standardised terminologies. This would ensure that researchers have a computable set of data to build robust statistical methodologies and Artificial Intelligence based analyses, gaining insights from genomic and clinical data.
However, there are several clinical terminologies, such as Systematized Nomenclature of Medicine (SNOMED CT) and International Classification of Diseases (ICD). SNOMED CT is the most comprehensive multilingual health terminology in the world, while ICD is a classification specializing on disease description. The main difference between them is that SNOMED CT is much more detailed and can be used to capture fine-grained clinical information while ICD is primarily a classification designed for reporting.
In addition to clinical terminologies, a standard that defines which clinical data should be collected is also needed. For example, in this case it is useful to capture symptoms, risk factors and complications, among others. This is usually referred to as the information model . The new HL7 standard called Fast Healthcare Interoperable Resource (FHIR), stands out as the best choice, given its substantial uptake and excellent support for clinical terminologies.

1.1 Emerging standardization for COVID19

There are multiple efforts that currently aim to define the minimal COVID-19-relevant clinical data.
The World Health Organization (WHO) has developed a case-based reporting form and data dictionary, as well as interim guidance to clinicians regarding case definitions and clinical syndromes associated with COVID-19 (Table 1). Although the WHO’s forms are more likely to be accepted by clinical teams around the world, the resulting forms do not capture clinical symptoms and outcomes in detail, e.g. only a field for indicating if the patient was showing symptoms but not which symptoms. Similarly, clinical course and outcomes are captured in little detail.
Aiming to capture more details and interpret their clinical impact, the Australian National COVID-19 Clinical Evidence Taskforce(“Australian National COVID-19 Clinical Evidence Taskforce,” n.d.), has compiled a severity score that groups patients into four categories (Figure 3).
However, achieving international agreement on the exact thresholds for the grouping is likely difficult, especially as new evidence about the severity of individual symptoms becomes available(Menni et al., 2020). It might hence be a more prudent approach to capture symptoms directly, as taken by the COVID-19 host genetics initiative(The COVID-19 Host Genetics Initiative, 2020), which aims to annotate existing human genomic information in large BioBanks by collecting self-reported COVID-19 status from its participants. This consortium has put together a questionnaire aimed at capturing COVID-19 symptoms and co-morbidities, which may provide a way to capture the disease status directly from the patient.
Worldwide standards for classifications and terminologies have been updating the content to include concepts and terms that describe or classify COVID-19 related diseases and symptoms. A clinical diagnostic dictionary looking at the collection for these terms was put together for the COVID-19 host genetics initiative, collecting terms from both ICD10 and SNOMED (see Table 1).
This highlights the different approaches the two vocabularies have taken. ICD 10 opted for a high level “COVID-19” term to enable counting of the number of COVID-19 cases, while SNOMED International is adding several COVID-19 related diagnosis codes to SNOMED CT, providing the ability to capture more specific data about the impact of the disease. Note that SNOMED CT allows for these cases to be grouped and cases counted.
There are also initiatives to develop data models for sharing COVID-19 clinical data using the Fast Healthcare Interoperable Resource (FHIR) standard from HL7 International. One such example is from Logical Health, a consortium of healthcare providers and technical companies in the USA. The FHIR Implementation Guide provided by Logical Health is a guide for capturing information to help with the treatment of patients in hospital.

1.2 What could interoperability look like for COVID-19

Using existing technology and incorporating the above discussed guidelines for COVID-19 symptoms and severity, we built an example FHIR Implementation Guide (FHIR IG) and implemented it as a FHIR questionnaire (see Table 1). This allows the flexible collection of relevant terms for a specific use case and allows them to be expressed as an input form for data collection, e.g. into GISAID. Unlike the FHIR IG from Logica, which focuses on patient care, patient screening, public health reporting, and general research, we designed the questionnaire (fields and values) for the specific use case of linking genomic data with clinical outcomes.
The FHIR IG captures the following types of information:
The FHIR IG also provides a set of standard terms from the SNOMED CT clinical terminology in the form of Value Sets. These are available in the documentation as well as programmatically from a clinical terminology service. The FHIR IG also provides user interface advice – with an example of an implementation for the form used to collect the information shown in Figure 4.
The FHIR IG provides the guidance needed to build different approaches to data collection. For example, one approach might be to use data extracted from an Electronic Medical Record (EMR) system or a research Electronic Data Capture (EDC) system like REDCap(Harris et al., 2019) for sharing with an organisation such as GISAID. There are existing tools that can be used to facilitate this transformation(Metke-Jimenez & Hansen, 2019). Alternatively, a specific cloud-based web form can be built to capture data and store it in a cloud based FHIR repository for later analyses.
The value sets developed for the different fields in the clinical entry form can be browsed using a terminology browser. Figure 5 shows the symptoms-value set in the CSIRO Shrimp browser, a front end for CSIRO’s terminology server Ontoserver(Metke-Jimenez, Steel, Hansen, & Lawley, 2018).

2. Clinical workflows need to revisit entries

While GISAID enables updates to submitted entries as more patient data becomes available, updating a submitted entry with clinical information is currently not a wide-spread practice. This in part is due to privacy restriction having prevented the sharing of patient information(Dyer, 2020). While the current content of GISAID was carefully designed to preserve privacy, adding linkages to clinical databases may require a re-structure even with de-identification protocols in place(Bauer et al., 2020; “Massive coronavirus sequencing efforts urgently need patient data - Nature India,” n.d.). For example, in regions with low prevalence, the exact location in combination with height and weight can be identifiable. For such a future addition, a clinical record guardian may be needed to provide access to clinical data via a tier system.
Other likely factors are the time-consuming aspect of a task that does not immediately save lives, compounded by the reference laboratories having to chase up busy clinical teams who may not see the immediate benefit. While compiling patient information will remain a labour-intensive task, at least the design of the input forms can help by not increasing the data-entry burden unduly.
Walking the tight rope between capturing enough data in a standardized way, but also making entry not so onerous to deter individuals from wanting to submit information in the first place, is an ongoing challenge. For our case-study FHIR IG, we have chosen to make most of the data fields simple check boxes, with the possibility of selecting more granular concepts using auto-complete style search powered by the terminology server. This expands on the recommendations from the WHO’s guidance, while still ensuring quick and efficient data capture with consistency across the world.
Implementing the COVID-19 symptom-capture as check boxes is possible because most guidelines provide a limited list of symptoms to capture. Should this list be expanded in the future or for other viruses, such as influenza virus and Respiratory Syncytial Virus, “auto complete” search or drop-down list can be easily added to the FHIR IG.
However, it must be stressed that manual data re-entry even with the use of a FHIR questionnaire, can only be an intermediate solution as efficacity and accuracy can only be achieved by enabling interoperability with clinical systems and data pre-population through FHIR standards like Structured Data Capture. For example, while investigating the D614G mutation(Korber et al., 2020), it was discovered that VIC31 and VIC50 isolates originate from the same patient, and it is likely that more such duplicates exist and complicate data analysis. Similarly, the patient home state might be different to the submitting laboratory potentially confusing epidemiological analyses, as was shown to be the case in India(Mehrotra, 2020).

Recommendations

In order to assess and detect a shift in the clinical presentation of COVID-19, de-identified patient data needs to be collected in a more systematic way. We hence recommend three elements for the medical and scientific community to consider for capturing COVID-19 better:
  1. Define the common information model and standard code sets to describe patient “journeys” in coordination with WHO.
  2. Work towards full interoperability where the EMRs can pre-populate the FHIR questionnaire, however this first step of creating a standard questionnaire with FHIR IG(Metke-Jimenez & Hansen, 2019) already represents a substantial advancement.
  3. Update clinical workflows to revisit entries and update information.
Anticipating the opportunity for retrospective data intake in a more controlled fashion, GISAID has a mechanism to reach out to data submitters to update entries. As a more immediate improvement, GISIAD now provides a filter for serving out cleaned data correcting and consolidating 26,838 entries (see consolidated entries as of 15th May 2020 in Supplemental File 1), which is aided by a data curation tool. All future data ingested as of 27 April 2020 will capture patient-data with entry support ensuring consistency.
These measures are valuable because the pandemic could well continue/re-emerge for some time creating the potential for new virus strains to be linked to decreased or increased case severity and/or fatality, and potentially affect the efficacy of vaccines and countermeasures. GISAID offers clade/lineage and variant information to facilitate genotype-phenotype analyses. Gaining experience in controlled data collection increases our preparedness for future ‘Disease X’ outbreaks or pandemics, and enables to the better support of research work for other infectious diseases such as Influenza and the Respiratory Syncytial Virus.
Acknowledgments
ST was supported by a grant awarded to Timothy Barkham and Swaine Chen by the Temasek Foundation and by the Genome Institute of Singapore, ST and SMS are supported by the Agency for Science, Technology and Research (A*STAR). APs work on the automated meta-data curation tool is supported by Institut Pasteur with feedback from its EpiCoVdata curation team aiding GISAID. CSIRO is supported by a grant awarded to SSV by the Coalition for Epidemic Preparedness Innovations (CEPI).
Competing Interests
The authors declare that there are no competing interests.
Author Contribution
DCB, SSV and DPH conceived the paper. ST and AP structured the data. AM, LOWW, JY conducted the analysis. DCB, SM, KE, DPH and SSV written the paper. All authors finalized the document.
Data Availability
Not applicable
Ethical Statement
Not applicable

References

Australian National COVID-19 Clinical Evidence Taskforce. (n.d.). Retrieved May 12, 2020, from https://covid19evidence.net.au/
Banerjee, S., Dhar, S., Bhattacharjee, S., & Bhattacharjee, P. (2020). Decoding the lethal effect of SARS-CoV-2 (novel coronavirus) strains from global perspective: molecular pathogenesis and evolutionary divergence. BioRxiv . doi:10.1101/2020.04.06.027854
Bauer, D. C., Tay, A. P., Wilson, L. O. W., Reti, D., Hosking, C., McAuley, A. J., … Vasan, S. S. (2020). Supporting pandemic responseusing genomics and bioinformatics: a case study on the emergent SARS-CoV-2 outbreak. Transboundary and Emerging Diseases .
Becerra-Flores, M., & Cardozo, T. (2020). SARS-CoV-2 viral spike G614 mutation exhibits higher case fatality rate. International Journal of Clinical Practice . doi:10.1111/ijcp.13525
Chu, H., Chan, J. F.-W., Yuen, T. T.-T., Shuai, H., Yuan, S., Wang, Y., … Yuen, K.-Y. (2020). Comparative tropism, replication kinetics, and cell damage profiling of SARS-CoV-2 and SARS-CoV with implications for clinical manifestations, transmissibility, and laboratory studies of COVID-19: an observational study. The Lancet Microbe . doi:10.1016/S2666-5247(20)30004-5
Dyer, C. (2020). Covid-19: Rules on sharing confidential patient information are relaxed in England. BMJ (Clinical Research Ed.) ,369 , m1378. doi:10.1136/bmj.m1378
Elbe, S., & Buckland-Merrett, G. (2017). Data, disease and diplomacy: GISAID’s innovative contribution to global health. Global Challenges , 1 (1), 33–46. doi:10.1002/gch2.1018
Harris, P. A., Taylor, R., Minor, B. L., Elliott, V., Fernandez, M., O’Neal, L., … REDCap Consortium. (2019). The REDCap consortium: Building an international community of software platform partners.Journal of Biomedical Informatics , 95 , 103208. doi:10.1016/j.jbi.2019.103208
Korber, B., Fischer, W., Gnanakaran, S. G., Yoon, H., Theiler, J., Abfalterer, W., … Sheffield COVID-19 Genomics Group. (2020). Spike mutation pipeline reveals the emergence of a more transmissible form of SARS-CoV-2. BioRxiv . doi:10.1101/2020.04.29.069054
Massive coronavirus sequencing efforts urgently need patient data - Nature India. (n.d.). Retrieved May 27, 2020, from https://www.natureasia.com/en/nindia/article/10.1038/nindia.2020.75
Mehrotra, K. (2020, May 27). ‘Unassigned’ coronavirus cases near 3,000, rise as curbs on movement lifted. The Indian Express .
Menni, C., Valdes, A. M., Freidin, M. B., Sudre, C. H., Nguyen, L. H., Drew, D. A., … Spector, T. D. (2020). Real-time tracking of self-reported symptoms to predict potential COVID-19. Nature Medicine . doi:10.1038/s41591-020-0916-2
Metke-Jimenez, A., & Hansen, D. (2019). FHIRCap: Transforming REDCap forms into FHIR resources. AMIA Joint Summits on Translational Science Proceedings AMIA Summit on Translational Science , 2019 , 54–63.
Metke-Jimenez, A., Steel, J., Hansen, D., & Lawley, M. (2018). Ontoserver: a syndicated terminology server. Journal of Biomedical Semantics , 9 (1), 24. doi:10.1186/s13326-018-0191-z
Shrimp browser citable link for COVID-19 symptoms. (n.d.). Retrieved May 12, 2020, from https://ontoserver.csiro.au/shrimp/vs.html?system=undefined&valueSetUri=http%3A%2F%2Fgenomics.ontoserver.csiro.au%2Ffhir%2Fcovid19%2FValueSet%2FCovid19SymptomsValueSet&valueSetId=Covid19SymptomsValueSet&fhir=https://r4.ontoserver.csiro.au/fhir
Su, Y., Anderson, D., Young, B., Zhu, F., Linster, M., Kalimuddin, S., … Smith, G. (2020). Discovery of a 382-nt deletion during the early evolution of SARS-CoV-2. BioRxiv . doi:10.1101/2020.03.11.987222
The COVID-19 Host Genetics Initiative. (2020). The COVID-19 Host Genetics Initiative, a global initiative to elucidate the role of host genetic factors in susceptibility and severity of the SARS-CoV-2 virus pandemic. European Journal of Human Genetics . doi:10.1038/s41431-020-0636-6
Worldometers coronavirus. (n.d.). Retrieved April 28, 2020, from https://www.worldometers.info/coronavirus/
Yao, H., Lu, X., Chen, Q., Xu, K., Chen, Y., Cheng, L., … Li, L. (2020). Patient-derived mutations impact pathogenicity of SARS-CoV-2.MedRxiv . doi:10.1101/2020.04.14.20060160
Zhang, X., Tan, Y., Ling, Y., Lu, G., Liu, F., Yi, Z., … Lu, H. (2020). Viral and host factors related to the clinical outcome of COVID-19. Nature . doi:10.1038/s41586-020-2355-0