Article
Being able to link clinical outcomes to virus strains is a critical
component of understanding COVID-19, however current data collection
practices hamper such analyses and require updating to support robust
insights gained from the data collected.
GISAID, established originally as the Global Initiative on Sharing All
Influenza Data(Elbe & Buckland-Merrett, 2017) has widened its remit
with the EpiCoV™ database to become the principal
platform for the sharing of genomic sequences of SARS-CoV-2 (hCoV-19)
from around the world. Such convergence by the global scientific
community around a single database is critical to permit a
near-real-time analysis of how the virus is evolving. While currently
only 1 out of 165 confirmed cases (“Worldometers coronavirus,” n.d.)
sees the virus sequence submitted (i.e. 7,663,708 COVID-19 cases and
46,251 published SARS-CoV-2 sequences as of 13 June 2020), it
nonetheless represents the most thorough surveillance of an emerging
virus outbreak in history (“Massive coronavirus sequencing efforts
urgently need patient data - Nature India,” n.d.).
It is therefore critical to supplement the collected information on the
virus’s genome with the other critical component informing patient
outcome: medical information. Such de-identified patient data would
provide the missing information that enables the virus’s evolution to be
linked to its host’s clinical factors. For example, several studies have
suggested the emergence of virus isolates associated with greaterin vitro titres and cytopathic effects(Yao et al., 2020), greater
transmissibility(Korber et al., 2020), higher fatality(Becerra-Flores &
Cardozo, 2020), aggressive(Banerjee, Dhar, Bhattacharjee, &
Bhattacharjee, 2020), attenuated(Su et al., 2020) or similar(Zhang et
al., 2020) phenotypes with consequent outcomes.
These observed variations, especially disease severity and outcomes, may
be attributable to genomic evolution and adaptation to the new human
host. However, current analyses are confounded by factors such as
co-morbidities, capacity of the health care system in terms of
diagnostic testing, treatment choices, and reporting of severity and
fatality – making it impossible to robustly link patient outcome to
genomic changes in the virus. This limits studies to being merely
observational by reporting genomic differences of the virus(Bauer et
al., 2020) or inferring pathogenicity from cell culture measurements
such as replication rate(Yao et al., 2020) and cell toxicity(Chu et al.,
2020). While such in silico and in vitro studies are
insightful, they are not a reliable predictor of disease severityin vivo .
Recognizing the need for clinical data, GISAID enables “patient
status” to be recorded for each submitted isolate, but typically only
3% have provided relevant information. For instance, 9% (506/5122) of
submitted isolates have this field filled in and of these only 33%
(164) have provided clinical information as of 15 May 2020 (Figure 1).
This highlights two areas where current processes hamper sustainable and
meaningful data collection. Firstly, information is currently not
captured in a standardized form that is tailored to COVID-19 infections;
secondly patient information is frequently not available when genomic
information is submitted, and workflows are not set up to amend entries
retrospectively.
1. Capturing clinical data in standardised
forms
Data that is collected and submitted to a central repository such as
GISAID likely comes from multiple sources, with consequently a wide
range of digital-readiness levels. For example, it might be extracted
from Electronic Medical Records (EMRs) where the data is already in a
structured form. However, it may also be that relevant information needs
to first be extracted out of digital or paper based clinical notes. In
the latter case, the same clinical symptom might be described
differently, complicating downstream reporting or grouping of records.
Hence converting clinical observations into standardized terms, so
called clinical terminologies that are applicable across the world, is
relevant (Figure 2).
While the progression towards EMRs is a much larger, multilayer problem
that cannot be addressed quickly even or especially amid a pandemic, the
mode of primary data collection into the central repository can be
controlled by introducing standardised fields implementing standardised
terminologies. This would ensure that researchers have a computable set
of data to build robust statistical methodologies and Artificial
Intelligence based analyses, gaining insights from genomic and clinical
data.
However, there are several clinical terminologies, such as Systematized
Nomenclature of Medicine (SNOMED CT) and International Classification of
Diseases (ICD). SNOMED CT is the most comprehensive multilingual health
terminology in the world, while ICD is a classification specializing on
disease description. The main difference between them is that SNOMED CT
is much more detailed and can be used to capture fine-grained clinical
information while ICD is primarily a classification designed for
reporting.
In addition to clinical terminologies, a standard that defines which
clinical data should be collected is also needed. For example, in this
case it is useful to capture symptoms, risk factors and complications,
among others. This is usually referred to as the information
model . The new HL7 standard called Fast Healthcare Interoperable
Resource (FHIR), stands out as the best choice, given its substantial
uptake and excellent support for clinical terminologies.
1.1 Emerging standardization for COVID19
There are multiple efforts that currently aim to define the minimal
COVID-19-relevant clinical data.
The World Health Organization (WHO) has developed a case-based reporting
form and data dictionary, as well as interim guidance to clinicians
regarding case definitions and clinical syndromes associated with
COVID-19 (Table 1). Although the WHO’s forms are more likely to be
accepted by clinical teams around the world, the resulting forms do not
capture clinical symptoms and outcomes in detail, e.g. only a field for
indicating if the patient was showing symptoms but not which symptoms.
Similarly, clinical course and outcomes are captured in little detail.
Aiming to capture more details and interpret their clinical impact, the
Australian National COVID-19 Clinical Evidence Taskforce(“Australian
National COVID-19 Clinical Evidence Taskforce,” n.d.), has compiled a
severity score that groups patients into four categories (Figure 3).
However, achieving international agreement on the exact thresholds for
the grouping is likely difficult, especially as new evidence about the
severity of individual symptoms becomes available(Menni et al., 2020).
It might hence be a more prudent approach to capture symptoms directly,
as taken by the COVID-19 host genetics initiative(The COVID-19 Host
Genetics Initiative, 2020), which aims to annotate existing human
genomic information in large BioBanks by collecting self-reported
COVID-19 status from its participants. This consortium has put together
a questionnaire aimed at capturing COVID-19 symptoms and co-morbidities,
which may provide a way to capture the disease status directly from the
patient.
Worldwide standards for classifications and terminologies have been
updating the content to include concepts and terms that describe or
classify COVID-19 related diseases and symptoms. A clinical diagnostic
dictionary looking at the collection for these terms was put together
for the COVID-19 host genetics initiative, collecting terms from both
ICD10 and SNOMED (see Table 1).
This highlights the different approaches the two vocabularies have
taken. ICD 10 opted for a high level “COVID-19” term to enable
counting of the number of COVID-19 cases, while SNOMED International is
adding several COVID-19 related diagnosis codes to SNOMED CT, providing
the ability to capture more specific data about the impact of the
disease. Note that SNOMED CT allows for these cases to be grouped and
cases counted.
There are also initiatives to develop data models for
sharing COVID-19 clinical data using the Fast Healthcare Interoperable
Resource (FHIR) standard from HL7 International. One such example is
from Logical Health, a consortium of healthcare providers and technical
companies in the USA. The FHIR Implementation Guide provided by Logical
Health is a guide for capturing information to help with the treatment
of patients in hospital.
1.2 What could interoperability look like for COVID-19
Using existing technology and incorporating the above discussed
guidelines for COVID-19 symptoms and severity, we built an example FHIR
Implementation Guide (FHIR IG) and implemented it as a FHIR
questionnaire (see Table 1). This allows the flexible collection of
relevant terms for a specific use case and allows them to be expressed
as an input form for data collection, e.g. into GISAID. Unlike the FHIR
IG from Logica, which focuses on patient care, patient screening, public
health reporting, and general research, we designed the questionnaire
(fields and values) for the specific use case of linking genomic data
with clinical outcomes.
The FHIR IG captures the following types of information:
- Demographic information – such as the age and gender of the patient
- Pre-existing clinical information – such as co-morbidities and
medication
- Travel history
- Observed COVID Symptoms
- Severity of COVID disease
- Outcome
- Immunization history
The FHIR IG also provides a set of standard terms from the SNOMED CT
clinical terminology in the form of Value Sets. These are available in
the documentation as well as programmatically from a clinical
terminology service. The FHIR IG also provides user interface advice –
with an example of an implementation for the form used to collect the
information shown in Figure 4.
The FHIR IG provides the guidance needed to build different approaches
to data collection. For example, one approach might be to use data
extracted from an Electronic Medical Record (EMR) system or a research
Electronic Data Capture (EDC) system like REDCap(Harris et al., 2019)
for sharing with an organisation such as GISAID. There are existing
tools that can be used to facilitate this transformation(Metke-Jimenez
& Hansen, 2019). Alternatively, a specific cloud-based web form can be
built to capture data and store it in a cloud based FHIR repository for
later analyses.
The value sets developed for the different fields in the clinical entry
form can be browsed using a terminology browser. Figure 5 shows the
symptoms-value set in the CSIRO Shrimp browser, a front end for CSIRO’s
terminology server Ontoserver(Metke-Jimenez, Steel, Hansen, & Lawley,
2018).
2. Clinical workflows need to revisit entries
While GISAID enables updates to submitted entries as more patient data
becomes available, updating a submitted entry with clinical information
is currently not a wide-spread practice. This in part is due to privacy
restriction having prevented the sharing of patient information(Dyer,
2020). While the current content of GISAID was carefully designed to
preserve privacy, adding linkages to clinical databases may require a
re-structure even with de-identification protocols in place(Bauer et
al., 2020; “Massive coronavirus sequencing efforts urgently need
patient data - Nature India,” n.d.). For example, in regions with low
prevalence, the exact location in combination with height and weight can
be identifiable. For such a future addition, a clinical record guardian
may be needed to provide access to clinical data via a tier system.
Other likely factors are the time-consuming aspect of a task that does
not immediately save lives, compounded by the reference laboratories
having to chase up busy clinical teams who may not see the immediate
benefit. While compiling patient information will remain a
labour-intensive task, at least the design of the input forms can help
by not increasing the data-entry burden unduly.
Walking the tight rope between capturing enough data in a standardized
way, but also making entry not so onerous to deter individuals from
wanting to submit information in the first place, is an ongoing
challenge. For our case-study FHIR IG, we have chosen to make most of
the data fields simple check boxes, with the possibility of
selecting more granular concepts using auto-complete style search
powered by the terminology server. This expands on the recommendations
from the WHO’s guidance, while still ensuring quick and efficient data
capture with consistency across the world.
Implementing the COVID-19 symptom-capture as check boxes is possible
because most guidelines provide a limited list of symptoms to capture.
Should this list be expanded in the future or for other viruses, such as
influenza virus and Respiratory Syncytial Virus, “auto complete”
search or drop-down list can be easily added to the FHIR IG.
However, it must be stressed that manual data re-entry even with the use
of a FHIR questionnaire, can only be an intermediate solution as
efficacity and accuracy can only be achieved by enabling
interoperability with clinical systems and data pre-population through
FHIR standards like Structured Data Capture. For example, while
investigating the D614G mutation(Korber et al., 2020), it was discovered
that VIC31 and VIC50 isolates originate from the same patient, and it is
likely that more such duplicates exist and complicate data analysis.
Similarly, the patient home state might be different to the submitting
laboratory potentially confusing epidemiological analyses, as was shown
to be the case in India(Mehrotra, 2020).
Recommendations
In order to assess and detect a
shift in the clinical presentation of COVID-19, de-identified patient
data needs to be collected in a more systematic way. We hence recommend
three elements for the medical and scientific community to consider for
capturing COVID-19 better:
- Define the common information model and standard code sets to describe
patient “journeys” in coordination with WHO.
- Work towards full interoperability where the EMRs can pre-populate the
FHIR questionnaire, however this first step of creating a standard
questionnaire with FHIR IG(Metke-Jimenez & Hansen, 2019) already
represents a substantial advancement.
- Update clinical workflows to revisit entries and update information.
Anticipating the opportunity for retrospective data intake in a more
controlled fashion, GISAID has a mechanism to reach out to data
submitters to update entries. As a more immediate improvement, GISIAD
now provides a filter for serving out cleaned data correcting and
consolidating 26,838 entries (see consolidated entries as of 15th May
2020 in Supplemental File 1), which is aided by a data curation tool.
All future data ingested as of 27 April 2020 will capture patient-data
with entry support ensuring consistency.
These measures are valuable because the pandemic could well
continue/re-emerge for some time creating the potential for new virus
strains to be linked to decreased or increased case severity and/or
fatality, and potentially affect the efficacy of vaccines and
countermeasures. GISAID offers clade/lineage and variant information to
facilitate genotype-phenotype analyses. Gaining experience in controlled
data collection increases our preparedness for future ‘Disease X’
outbreaks or pandemics, and enables to the better support of research
work for other infectious diseases such as Influenza and the Respiratory
Syncytial Virus.
Acknowledgments
ST was supported by a grant awarded to Timothy Barkham and Swaine Chen
by the Temasek Foundation and by the Genome Institute of Singapore, ST
and SMS are supported by the Agency for Science, Technology and Research
(A*STAR). APs work on the automated meta-data curation tool is supported
by Institut Pasteur with feedback from its EpiCoV™data curation team aiding GISAID. CSIRO is supported by a grant awarded
to SSV by the Coalition for Epidemic Preparedness Innovations (CEPI).
Competing Interests
The authors declare that there are no competing interests.
Author Contribution
DCB, SSV and DPH conceived the paper. ST and AP structured the data. AM,
LOWW, JY conducted the analysis. DCB, SM, KE, DPH and SSV written the
paper. All authors finalized the document.
Data Availability
Not applicable
Ethical Statement
Not applicable
References
Australian National COVID-19 Clinical Evidence Taskforce. (n.d.).
Retrieved May 12, 2020, from https://covid19evidence.net.au/
Banerjee, S., Dhar, S., Bhattacharjee, S., & Bhattacharjee, P. (2020).
Decoding the lethal effect of SARS-CoV-2 (novel coronavirus) strains
from global perspective: molecular pathogenesis and evolutionary
divergence. BioRxiv . doi:10.1101/2020.04.06.027854
Bauer, D. C., Tay, A. P., Wilson, L. O. W., Reti, D., Hosking, C.,
McAuley, A. J., … Vasan, S. S. (2020). Supporting pandemic
responseusing genomics and bioinformatics: a case study on the emergent
SARS-CoV-2 outbreak. Transboundary and Emerging Diseases .
Becerra-Flores, M., & Cardozo, T. (2020). SARS-CoV-2 viral spike G614
mutation exhibits higher case fatality rate. International Journal
of Clinical Practice . doi:10.1111/ijcp.13525
Chu, H., Chan, J. F.-W., Yuen, T. T.-T., Shuai, H., Yuan, S., Wang, Y.,
… Yuen, K.-Y. (2020). Comparative tropism, replication kinetics,
and cell damage profiling of SARS-CoV-2 and SARS-CoV with implications
for clinical manifestations, transmissibility, and laboratory studies of
COVID-19: an observational study. The Lancet Microbe .
doi:10.1016/S2666-5247(20)30004-5
Dyer, C. (2020). Covid-19: Rules on sharing confidential patient
information are relaxed in England. BMJ (Clinical Research Ed.) ,369 , m1378. doi:10.1136/bmj.m1378
Elbe, S., & Buckland-Merrett, G. (2017). Data, disease and diplomacy:
GISAID’s innovative contribution to global health. Global
Challenges , 1 (1), 33–46. doi:10.1002/gch2.1018
Harris, P. A., Taylor, R., Minor, B. L., Elliott, V., Fernandez, M.,
O’Neal, L., … REDCap Consortium. (2019). The REDCap consortium:
Building an international community of software platform partners.Journal of Biomedical Informatics , 95 , 103208.
doi:10.1016/j.jbi.2019.103208
Korber, B., Fischer, W., Gnanakaran, S. G., Yoon, H., Theiler, J.,
Abfalterer, W., … Sheffield COVID-19 Genomics Group. (2020).
Spike mutation pipeline reveals the emergence of a more transmissible
form of SARS-CoV-2. BioRxiv . doi:10.1101/2020.04.29.069054
Massive coronavirus sequencing efforts urgently need patient data -
Nature India. (n.d.). Retrieved May 27, 2020, from
https://www.natureasia.com/en/nindia/article/10.1038/nindia.2020.75
Mehrotra, K. (2020, May 27). ‘Unassigned’ coronavirus cases near 3,000,
rise as curbs on movement lifted. The Indian Express .
Menni, C., Valdes, A. M., Freidin, M. B., Sudre, C. H., Nguyen, L. H.,
Drew, D. A., … Spector, T. D. (2020). Real-time tracking of
self-reported symptoms to predict potential COVID-19. Nature
Medicine . doi:10.1038/s41591-020-0916-2
Metke-Jimenez, A., & Hansen, D. (2019). FHIRCap: Transforming REDCap
forms into FHIR resources. AMIA Joint Summits on Translational
Science Proceedings AMIA Summit on Translational Science , 2019 ,
54–63.
Metke-Jimenez, A., Steel, J., Hansen, D., & Lawley, M. (2018).
Ontoserver: a syndicated terminology server. Journal of Biomedical
Semantics , 9 (1), 24. doi:10.1186/s13326-018-0191-z
Shrimp browser citable link for COVID-19 symptoms. (n.d.). Retrieved May
12, 2020, from
https://ontoserver.csiro.au/shrimp/vs.html?system=undefined&valueSetUri=http%3A%2F%2Fgenomics.ontoserver.csiro.au%2Ffhir%2Fcovid19%2FValueSet%2FCovid19SymptomsValueSet&valueSetId=Covid19SymptomsValueSet&fhir=https://r4.ontoserver.csiro.au/fhir
Su, Y., Anderson, D., Young, B., Zhu, F., Linster, M., Kalimuddin, S.,
… Smith, G. (2020). Discovery of a 382-nt deletion during the
early evolution of SARS-CoV-2. BioRxiv .
doi:10.1101/2020.03.11.987222
The COVID-19 Host Genetics Initiative. (2020). The COVID-19 Host
Genetics Initiative, a global initiative to elucidate the role of host
genetic factors in susceptibility and severity of the SARS-CoV-2 virus
pandemic. European Journal of Human Genetics .
doi:10.1038/s41431-020-0636-6
Worldometers coronavirus. (n.d.). Retrieved April 28, 2020, from
https://www.worldometers.info/coronavirus/
Yao, H., Lu, X., Chen, Q., Xu, K., Chen, Y., Cheng, L., … Li, L.
(2020). Patient-derived mutations impact pathogenicity of SARS-CoV-2.MedRxiv . doi:10.1101/2020.04.14.20060160
Zhang, X., Tan, Y., Ling, Y., Lu, G., Liu, F., Yi, Z., … Lu, H.
(2020). Viral and host factors related to the clinical outcome of
COVID-19. Nature . doi:10.1038/s41586-020-2355-0