Characterising the preprint response to the COVID-19 epidemic
Since 2016, bioRxiv has become the dominant preprint repository for the life sciences \citep{asapbio_biology_2019,sever_biorxiv_2019}, though multiple other generalist and specialist preprint repositories covering life sciences are also well-used. Search queries were therefore conducted within the English-language arXiv, bioRxiv, and medRxiv repositories by matching query text in titles and abstracts. Searches were conducted for SARS-CoV-2 as well as additional pathogens for comparison (Supplementary Table), aggregating results across all relevant search terms, including disease names (e.g. “COVID-19” for SARS-CoV-2), commonly used names of higher taxonomy (e.g. “coronavirus” for SARS-CoV-2), and acronyms where established (e.g. “ZIKV” for Zika virus). Metadata extracted included date of posting (defined as the initial deposition date for preprints with multiple versions) and subject area categorisation (selected by the uploading author). Each preprint server was accessed programmatically: arXiv via the arXiv API \cite{arxiv_arxiv_2019}, bioRxiv via the Rxivist API \citep*{abdill_tracking_2019}, and medRxiv via the ‘medrxivr’ package v0.0.1.9 \citep*{mcguinness_medrxivr_2020}. The small number of manuscripts that have been withdrawn were not excluded, as the aim here was to quantify trends in preprint posting rather than endpoints. Cumulative frequency curves were then plotted for aggregated preprint totals for each pathogen. Rates of preprint posting were estimated for each pathogen as slope parameters from fitted simple linear regressions with ordinary least-squares estimation. All preprint server interfacing and data manipulation was carried out using R v3.6.1, and all supporting code is available at https://github.com/lbrierley/epi_preprint.
bioRxiv received the earliest preprint research regarding COVID-19, including the first preprint deposited on January 20th \citep{chen_mathematical_2020}, 22 days after health authority notification of the initial cluster of cases in Wuhan (Figure 1a). However, from early February onwards, medRxiv became the dominant preprint server for COVID-19 research and contains 561 preprints to date of extract (25/3/20; 71.5% of all preprints identified here) (Figure 1b). As a more generalised repository focusing on physical and computer sciences, arXiv contained a smaller number of COVID-19 preprints (Figure 1c). The majority of COVID-19 preprints were (or were categorised as) population biology/epidemiological, microbiological, or bioinformatic/genomic studies (Figure 1). Less frequently observed categories indicated availability of COVID-19 preprint research intersecting with a wide range of specialist areas, e.g. ophthalmology within medRxiv \citep{zhou_ophthalmologic_2020} and information science within arXiv \cite{strzelecki_infodemiological_2020}.