Virus extraction, RT-PCR, virus isolation
Viral RNA was extracted from swab samples using the MagMAXTM-96 AI/ND Viral RNA Isolation Kit (Ambion, Austin, TX) following the manufacturer’s procedures. Real-time RT-PCR was performed using previously published procedures, primers, and probes (Spackman et al., 2002) designed to detect the IAV matrix gene. RT-PCR assays utilized reagents provided in the Qiagen OneStep® RT-PCR kit (Qiagen, Germantown, MD). Following RT-PCR screening, all samples with Ct values <45 were attempted to be grown in embryonated chicken egg cultures (Woolcock, 2008), the harvested allantoic fluid were tested again by RT-PCR and samples with Ct values <22 were selected for sequencing. This cut-off was selected based on our previous experiences, as samples with higher Ct-values typically yield poor quality sequence results. Egg-grown virus isolates were sequenced using multiple standard methods including Sanger (Sanger, Nicklen, & Coulson, 1977), Roche 454 (Margulies et al., 2005), and Illumina HiSeq 2000 and MiSeq (Illumina, San Diego, CA) sequencing (Dusek et al., 2014; Guan et al., 2019; Hall et al., 2014).
Datasets for phylodynamic analyses
Global dataset: The polymerase basic protein 2 (PB2) segment was selected as the basis for phylodynamic analysis, given (a) it is the largest internal segment of the IAV genome (maximizing the number of nucleotides in the analysis (>2000 nts)), (b) it is relatively fast evolving for an internal gene with a mean of 3.15 x 10-3 (95% HPD 1.94-4.44) substitutions per site per year (Rubing Chen & Holmes, 2006), and (c) enables the investigation of transmission dynamics without targeting a specific subtype. The temporal signal of the PB2 gene and clock-like evolution of both global and local datasets were evaluated (Supplementary figure 1). All globally available avian and marine mammal IAV PB2 genes sequenced between 2009 and 2019, excluding those from Iceland, were downloaded from the National Center for Biotechnology Information Influenza Virus Resource database (NCBI IVR) (Bao et al., 2008) on February 12, 2020, resulting in 13,434 sequences. Iceland-derived sequences were downsampled separately, as described below. Duplicate sequences (based on collection date, location, and nucleotide content) and sequences with less than 75% unambiguous bases were removed, and all vaccine derivative and laboratory-synthesized recombinant sequences were excluded. Sequences in the dataset were only included if isolation dates, location, and host species were available, resulting in 7,210 remaining sequences. The subsequent downsampling strategy aimed to reduce the number of sequence taxa for computational efficiency and to mitigate sampling bias while maintaining genetic diversity in the dataset.
Downsampling of global taxa outside of Iceland (i.e. the ‘outgroup’): Four variables were considered important for explaining genetic diversity in the global outgroup IAV sequence dataset: geographic region, host taxa, sampling year, and hemagglutinin (HA) subtype. Five geographic region categories included North America, Europe, Asia, Africa, and South America (Australia and Antarctica were removed due to insufficient sequence counts). Fourteen HA subtype categories included H1, H2, H3, H4, H6, H8, H10, H11, H12, H13, H14, H15, H16, and pooled H5/7/9. H5, H7, and H9 were combined to reduce bias, as these were over-represented in the global dataset. Four host categories included Anseriformes, Charadriiformes, Galliformes, and Other, which comprised all other avian taxa and marine mammals.
To inform the downsampling strategy for the outgroup and evaluate if any of the four variables were correlated, a multiple correspondence analysis (MCA) was performed (JMP Pro v.14.0.0 (JMP Version 14.0.0, 1989-2019)). The MCA uses categorical data as input, which for this study included the sampling metadata associated with each sequence (region, host taxa, year, and HA subtype). Through representation of the variables in two-dimensional Euclidean space, significant clustering of HA subtypes with host taxa was detected (Supplementary figure 2), indicative of host-specific subtypes that are a well-known feature of influenza (Olson et al., 2014). These findings confirmed by previously published data on species-specificity of HA subtypes (Byrd-Leotis, Cummings, & Steinhauer, 2017; Long, Mistry, Haslam, & Barclay, 2019; Verhagen et al., 2015) led us to downsample the dataset stratifying taxa by two non-overlapping variables: geographic region and HA subtype. Downsampling of the global data resulted in 21-75 taxa per five geographic region categories and 6-30 taxa per 14 HA subtype categories, resulting in a total of 301 sequences in the global outgroup, with relative evenness across sampling years. This step was performed to mitigate sampling bias resulting in over-representation of species or viral strains, while accounting for genetic diversity in the dataset.
Downsampling of Iceland-derived taxa (i.e. the ‘ingroup’): Next, virus sequences from Iceland (including 35 downloaded from NCBI IVR and 58 novel viruses isolated and recently sequenced by our group (n=93)) (Dusek et al., 2022) were downsampled by stratifying taxa by HA subtype (generating 1-15 sequences per 14 HA subtype categories), resulting in 63 sequences. These 63 sequences were used for global and local discrete trait analyses and the ingroup dataset reflected the underlying composition of host-specific subtypes present in this localized system. To assist with rooting and time-calibration of the tree, historical avian sequences from NCBI IVR were downloaded for the years 1979-2008. These were downsampled by year to ensure one sequence per year, resulting in 30 historic sequences. The total downsampled dataset, including the outgroup (n=301), ingroup (n=63), and historic sequences (n=30) resulted in a total of 394 sequences (Gass J.D., 2021).
Europe-Iceland-North America Datasets: Following analyses which identified the most significant geographic regions acting as sources of IAVs to Iceland, a second dataset was constructed at a restricted scale to Europe, Iceland, and North America between 2009 and 2019. First, the cleaned global dataset described above (n=7245) was downsampled to include significant source regions of North America (n=3222) and Europe (n=407), totaling 3629 sequences. To identify at lower spatial resolution the source/sink locations relevant to Iceland, a K-means cluster analysis was performed (JMP Pro v.14.0.0 (JMP Version 14.0.0, 1989-2019)) using latitude/longitude coordinates for each of the 3629 sequences (obtained by extracting sampling location from the strain name of each sequence and searching in www.geonames.org). A total of 20 intraregional clusters resulted. Identified clusters with <50 sequences were combined with geographically proximal clusters to increase evenness of within cluster sequence counts for discrete traits analyses, resulting in 13 intraregional cluster locations within North America, Iceland, and the rest of Europe (Supplementary figure 3).
Viral sequences were then downsampled from 3629 to 743 taxa stratifying by intraregional cluster groupings and HA subtype. Two datasets were formed, both of which included the 743 downsampled sequences (557 from North America, 229 from Europe excluding Iceland), 30 historic sequences (same as the global analysis), and: (i) for discrete trait phylogeographic and phylodynamic analyses (non-geographic analyses of host transmission and subtype reassortment patterns using discrete diffusion models): 63 downsampled Iceland-derived sequences, totaling 836 sequences, and (ii) for continuous trait phylogeographic analysis, all 93 Iceland-derived sequences were included, totaling 866 sequences (Gass J.D., 2021). For purposes of clarity, we use the term ‘phylogeographic’ to refer to analyses that are geographic in nature and ‘phylodynamic’ to refer to analyses that focus on viral diffusion between non-geographic traits, namely transmission between species and reassortment of HA subtypes. The two datasets were formed because (a) heterogeneity in sampling among locations and host taxa can bias results of discrete trait analyses, therefore downsampling to ensure relative homeogeneity of trait groupings (while striving to preserve host and pathogen population diversity across space and time) is required, and (b) continuous trait analyses are robust against heterogeneity in sampling (Baele, Suchard, Rambaut, & Lemey, 2017; Lemey, Rambaut, Welch, & Suchard, 2010), thus the full Iceland dataset was included for continuous analyses (Gass J.D., 2021). Multiple sequence alignments of PB2 sequences were performed using MUSCLE in Geneious Prime 2020.01.02 (https://www.geneious.com) and trimmed to the open reading frame. Maximum-likelihood phylogenies of PB2 segments in the downsampled datasets were reconstructed using RAxML v8.2.12 (Stamatakis, 2006) and temporal signal was investigated using TempEst v1.5.3 (Rambaut, Lam, Max Carvalho, & Pybus, 2016) (Supplementary figure 1).