Characterising the preprint response to the COVID-19 epidemic
Since 2016, bioRxiv has become the dominant preprint repository for the
life sciences
\citep{asapbio_biology_2019,sever_biorxiv_2019}, though multiple other generalist and
specialist preprint repositories covering life sciences are also
well-used. Search queries were therefore conducted within the
English-language arXiv, bioRxiv, and medRxiv repositories by matching
query text in titles and abstracts. Searches were conducted for
SARS-CoV-2 as well as additional pathogens for comparison (Supplementary
Table), aggregating results across all relevant search terms, including
disease names (e.g. “COVID-19” for SARS-CoV-2), commonly used names of
higher taxonomy (e.g. “coronavirus” for SARS-CoV-2), and acronyms
where established (e.g. “ZIKV” for Zika virus). Metadata extracted
included date of posting (defined as the initial deposition date for
preprints with multiple versions) and subject area categorisation
(selected by the uploading author). Each preprint server was accessed
programmatically: arXiv via the arXiv API
\cite{arxiv_arxiv_2019}, bioRxiv via the
Rxivist API
\citep*{abdill_tracking_2019}, and medRxiv via the ‘medrxivr’ package v0.0.1.9
\citep*{mcguinness_medrxivr_2020}. The small number of manuscripts that have been withdrawn were
not excluded, as the aim here was to quantify trends in preprint posting
rather than endpoints. Cumulative frequency curves were then plotted for
aggregated preprint totals for each pathogen. Rates of preprint posting
were estimated for each pathogen as slope parameters from fitted simple
linear regressions with ordinary least-squares estimation. All preprint
server interfacing and data manipulation was carried out using R v3.6.1,
and all supporting code is available at
https://github.com/lbrierley/epi_preprint.
bioRxiv received the earliest preprint research regarding COVID-19,
including the first preprint deposited on January 20th \citep{chen_mathematical_2020}, 22 days after health authority notification of the initial
cluster of cases in Wuhan (Figure 1a). However, from early February
onwards, medRxiv became the dominant preprint server for COVID-19
research and contains 561 preprints to date of extract (25/3/20;
71.5% of all preprints identified here) (Figure 1b). As a more
generalised repository focusing on physical and computer sciences, arXiv
contained a smaller number of COVID-19 preprints (Figure 1c). The
majority of COVID-19 preprints were (or were categorised as) population
biology/epidemiological, microbiological, or bioinformatic/genomic
studies (Figure 1). Less frequently observed categories indicated
availability of COVID-19 preprint research intersecting with a wide
range of specialist areas, e.g. ophthalmology within medRxiv \citep{zhou_ophthalmologic_2020} and information science within arXiv \cite{strzelecki_infodemiological_2020}.