COVID-19 preprint research to date
While impossible to thoroughly discuss the volume of research responding
to the epidemic here, several key findings regarding the origins and
spread of SARS-CoV-2 have been rapidly disseminated across the global
community as preprints.
Regarding zoonotic origins, preprint research posted less than a month
after initial case notifications demonstrated the genome sequence of
SARS-CoV-2 to be most similar to those of several bat coronaviruses,
with a 96% sequence identity match to bat SARS-like CoV RaTG13
\citep{wu_complete_2020,zhou_pneumonia_2020}. However, SARS-CoV-2 appears to show distinct differences to
these bat coronaviruses in the receptor binding domain of its surface
spike protein. These differences result in efficient binding to the
human ACE2 cell receptor \citep{hoffmann_novel_2020}, likely a key determinant of the
efficiency of human-to-human spread. The receptor binding domain was
instead shown to have strong similarity to that of a coronavirus
isolated from diseased Malayan pangolins (Manis javanica)
\citep{liu_are_2020,wahba_identification_2020,wong_evidence_2020,xiao_isolation_2020}. Elsewhere, exceptionally timely preprint research
characterised and shared this spike protein’s molecular structure
\citep{wrapp_cryo-em_2020}.
Rapid epidemiological modelling efforts meant that multiple estimates of
the basic reproductive number (R0) for SARS-CoV-2 were
also able to be quickly disseminated through preprints, with consensus
around an R0 value of ~2.9 \citep{park_reconciling_2020}.
These preprints covered a wide variety of estimation methods and fitted
data, from deterministic compartmental models to stochastic simulations,
allowing a systematic review to be conducted as early as mid-February
\citep*{majumder_early_2020}.
One consistent feature of early research was the discrepancy between the
various initial names given to both the virus (e.g. 2019-nCoV) and its
resulting disease (e.g. NCIP; novel coronavirus-infected pneumonia). In
response, the supporting case for the nomenclature and classification of
SARS-CoV-2 from the ICTV Coronavirus Study Group consensus was itself
made available ahead of publication \citep{gorbalenya_severe_2020}, in order to drive
standardisation in the vast forthcoming literature.
Characterising the preprint response to the COVID-19 epidemic
Since 2016, bioRxiv has become the dominant preprint repository for the
life sciences
\citep{asapbio_biology_2019,sever_biorxiv_2019}, though multiple other generalist and
specialist preprint repositories covering life sciences are also
well-used. Search queries were therefore conducted within the
English-language arXiv, bioRxiv, and medRxiv repositories by matching
query text in titles and abstracts. Searches were conducted for
SARS-CoV-2 as well as additional pathogens for comparison (Supplementary
Table), aggregating results across all relevant search terms, including
disease names (e.g. “COVID-19” for SARS-CoV-2), commonly used names of
higher taxonomy (e.g. “coronavirus” for SARS-CoV-2), and acronyms
where established (e.g. “ZIKV” for Zika virus). Metadata extracted
included date of posting (defined as the initial deposition date for
preprints with multiple versions) and subject area categorisation
(selected by the uploading author). Each preprint server was accessed
programmatically: arXiv via the arXiv API
\cite{arxiv_arxiv_2019}, bioRxiv via the
Rxivist API
\citep*{abdill_tracking_2019}, and medRxiv via the ‘medrxivr’ package v0.0.1.9
\citep*{mcguinness_medrxivr_2020}. The small number of manuscripts that have been withdrawn were
not excluded, as the aim here was to quantify trends in preprint posting
rather than endpoints. Cumulative frequency curves were then plotted for
aggregated preprint totals for each pathogen. Rates of preprint posting
were estimated for each pathogen as slope parameters from fitted simple
linear regressions with ordinary least-squares estimation. All preprint
server interfacing and data manipulation was carried out using R v3.6.1,
and all supporting code is available at
https://github.com/lbrierley/epi_preprint.
bioRxiv received the earliest preprint research regarding COVID-19,
including the first preprint deposited on January 20th \citep{chen_mathematical_2020}, 22 days after health authority notification of the initial
cluster of cases in Wuhan (Figure 1a). However, from early February
onwards, medRxiv became the dominant preprint server for COVID-19
research and contains 561 preprints to date of extract (25/3/20;
71.5% of all preprints identified here) (Figure 1b). As a more
generalised repository focusing on physical and computer sciences, arXiv
contained a smaller number of COVID-19 preprints (Figure 1c). The
majority of COVID-19 preprints were (or were categorised as) population
biology/epidemiological, microbiological, or bioinformatic/genomic
studies (Figure 1). Less frequently observed categories indicated
availability of COVID-19 preprint research intersecting with a wide
range of specialist areas, e.g. ophthalmology within medRxiv \citep{zhou_ophthalmologic_2020} and information science within arXiv \cite{strzelecki_infodemiological_2020}.